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SEQUENCING BY HYBRIDIZATION OF A TARGET NUCLEIC ACID 
5 TO A MATRIX OF DEFINED OLIGONUCLEOTIDES 

BACKGROUND OF THE INVENTION 
The present invention relates to the sequencing, 
fingerprinting, and mapping of polymers, particularly 
10 biological polymers. The inventions may be applied, for 
example, in the sequencing, fingerprinting, or mapping of 
polynucleotides • 

The relationship between structure and function of 
macromolecules is of fundamental importance in the 
15 understanding of biological systems. These relationships are_ 
important to understanding, for example, the functions of 
enzymes, structural proteins, and signalling proteins, ways in 
which cells communicate with each other, as well as mechanisms 
of cellular control and metabolic feedback. 
20 Genetic information is critical in continuation of 

life processes. Life is substantially informationally based 
and its genetic content controls the growth and reproduction of 
the organism and its complements. Polypeptides, which are 
critical features of all living systems, are encoded by the 
25 genetic material of the cell. In particular, the properties of 
enzymes, functional proteins, and structural proteins are 
determined by the sequence of amino acids which make them up. 
As structure and function are integrally related, many 
biological functions may be explained by elucidating the 
3 0 underlying the structural features which provide those 

functions. For this reason / it has become very important to 
determine the genetic sequences of nucleotides which encode the 
enzymes, structural proteins, and other effectors of biological 
functions. In addition to segments of nucleotides which encode 
35 polypeptides, there are many nucleotide sequences which are 
involved in control and regulation of gene expression. 

The human genome project is directed toward 
determining the complete sequence the genome of the human 
organism. Although such a sequence would not correspond to the 
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sequence of any specific individual, it would provide 
significant information as to the general organization and 
specific sequences contained within segments from particular 
individuals. It would also provide mapping information which 
5 is very useful for further detailed studies. However, the need 
for highly rapid, accurate, and inexpensive sequencing 
technology is nowhere more apparent than in a demanding 
sequencing project such as this. To complete the sequencing of 
a human genome would require the determination of approximately 
10 3xl0 9 , or 3 billion base pairs. 

The procedures typically used today for sequencing 
include the Sanger dideoxy method, see, e.g., Sanger et al. 
(1977) Proc. Natl. Acad. Sci. USA , 74:54 63-5467, or the Maxam 
and Gilbert method, see, e.g., Maxam et al., (1980) Methods , in 

15 Enzymology, 65:499-559. The Sanger method utilizes enzymatic 
elongation procedures with chain terminating nucleotides. The 
Maxam and Gilbert method uses chemical reactions exhibiting 
specificity of reaction to generate nucleotide specific 
cleavages. Both methods require a practitioner to perform a 

2 0 large number of complex manual manipulations. These 

manipulations usually require isolating homogeneous DNA 
fragments, elaborate and tedious preparing of samples, 
preparing a separating gel, applying samples to the gel, 
electrophoresing the samples into this gel, working up the 

25 finished gel> and analyzing the results of the procedure. 

Thus, a less expensive, highly reliable, and labor 
efficient means for sequencing biological macromblecules is 
needed. A substantial reduction in cost and increase in speed 
of nucleotide sequencing would be very much welcomed, in 

30 particular, an automated system would improve the 

reproducibility and accuracy of procedures. The present 
invention satisfies these and other needs. 

SUMMARY OF THE INVENTION 
35 The present invention provides improved methods 

useful for de novo sequencing of an unknown polymer sequence, 
for verification of known sequences, for fingerprinting 
polymers, and for mapping homologous segments within a 
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sequence. By reducing the number of manual manipulations 
required and automating most of the steps, the speed, accuracy, 
and reliability of these procedures are greatly enhanced. 

The production of a substrate having a matrix of 
5 positionally defined regions with attached reagents exhibiting 
known recognition specificity can be used for the sequence 
analysis of a polymer. Although most directly applicable to 
sequencing, the present invention is also applicable to 
fingerprinting, mapping, and general screening of specific 
10 interactions. 

The present invention also provides a means to 
automate sequencing manipulations. The automation of the 
substrate production method and of the scan and analysis steps 
minimizes the need for human intervention. This simplifies the 
15 tasks and promotes reproducibility. 

The present invention provides a composition 
comprising a plurality of positionally distinguishable sequence 
specific reagents attached to a solid substrate, which reagents 
are capable of specifically binding to a predetermined subunit 
20 sequence of a preselected multi-subunit length having at least 
three subunits, said reagents representing substantially all 
possible sequences of said preselected length. In some 
embodiments, the subunit sequence is a polynucleotide sequence. 
In other embodiments, the specific reagent is an 
25 oligonucleotide of at least about five nucleotides, preferably 
at least eight nucleotides, more preferably at leaist 12 
nucleotides. Usually the specific reagents are all attached to 
a single solid substrate, and the reagents comprise at least 
3000 different sequences. In other embodiments, the reagents 
30 represents at least about 25% of the possible subsequences of 
said preselected length. Usually, the reagents are localized 
in regions of the substrate having a density of at least 25 
regions per square centimeter, and often the substrate has a 
surface area of less than about 4 square centimeters. 
35 The present invention also provides methods for 

analyzing a sequence of a polynucleotide, said method 
comprising the step of: 
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a) exposing said polynucleotide to a 
composition as described. 
It also provides useful methods for identifying or 
comparing a target sequence with a ref erenc ( i . e . , 
5 fingerprinting) , said method comprising the step of : 

a) exposing said target sequence to a 
composition as described; 

b) determining the pattern of positions of the 
reagents which specifically interact with 

10 the target sequence; and 

c) comparing the pattern with the pattern 
exhibited by the reference when exposed to 
the composition. 

By way of example and not limitation, such fingerprinting 
15 methods may be used for personal identification, genetic 
screening, identification of pathological conditions, 
determination of patterns of specific gene expression, and 
others. 

The present invention also provides methods for 
20 sequencing a segment of a polynucleotide comprising the steps 
of: 

a) combining: 

i) a substrate comprising a plurality of 
chemically synthesized and positionally 

25 distinguishable oligonucleotides capable of 

recognizing defined oligonucleotide 
sequences; and 

ii) a target polynucleotide; thereby forming 
high fidelity matched duplex structures of 

30 complementary subsequences of known 

sequence ; and 

b) determining which of said reagents have 
specifically interacted with subsequences in 
Said target polynucleotide. 

35 In one embodiment, the segment is substantially the 

entire length of said polynucleotide. 

In one embodiment, the substrates are beads. 
Preferably, the plurality of reagents comprise substantially 
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all possible subsequences of said preselected length found in 
said target. In another embodiment, the solid phase substrate 
is a single substrate having attached thereto reagents 
recognizing substantially all possible subsequences of 
5 preselected length found in said target. 

In another embodiment, the method further comprises 
the step of analyzing a plurality of said recognized 
subsequences to assemble a sequence of said target polymer. In 
a bead embodiment, at least some of the plurality of substrates 
10 have one subsequence specific reagent attached thereto, and the 
substrates are coded to indicate the sequence specificity of 
said reagent. 

The present invention also embraces a method of using 
a fluorescent nucleotide to detect interactions with 
15 oligonucleotide probes of known sequence, said method 
comprising: 

a) attaching said nucleotide to a target unknown 
polynucleotide sequence, and 

b) exposing said target polynucleotide sequence to 
20 a collection of positionally defined 

oligonucleotide probes of known sequences to 
determine the sequences of said probes which 
interact with said target. 

In a further refinement, an additional step is 
25 included of: 

a) collating said known sequences to determine the 
overlaps of said known sequences to determine 
the sequence of said target sequence. 

30 A method of mapping a plurality of sequences relative 

to one another is also provided, the method comprising: 

a) preparing a substrate having a plurality of 

positionally attached sequence specific probes 
are attached; 

35 b) exposing each of said sequences to said 

substrate, thereby determining the patterns of 
interaction between said sequence specific 
probes and said sequences; and 
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c) determining the relative locations of said 
sequence specific probe interactions on said 
sequences to determine the overlaps and order of 
said sequences. 

5 In one refinement, the sequence specific probes are 

oligonucleotides, applicable to where the target sequences are 
nucleic acid sequences • 

In the nucleic acid sequencing application, the steps 
of the sequencing process comprise: 
10 a) producing a matrix substrate having known 

positionally defined regions of known sequence 
specific oligonucleotide probes ; 

b) hybridizing a target polynucleotide to the 
15 positions on the matrix so that each of the 

positions which contain oligonucleotide probes 
complementary to a sequence on the target 
hybridize to the target molecule; 

20 c) detecting which positions have bound the target, 

thereby determining sequences which are found on 
the target; and 

d) analyzing the known sequences contained in the 
25 target to determine sequence overlaps and 

assembling the sequence of the target therefrom. 
The -enablement of the sequencing process by 
hybridization is based in large part upon the ability to 
synthesize a large number (e.g., to virtually saturate) of the 
30 possible overlapping sequence segments and distinguishing those 
probes which hybridize with fidelity from those which have 
mismatched bases, and to analyze a highly complex pattern of 
hybridization results to determine the overlap regions - 

The detecting of the positions which bind the target 
5 sequence would typically be through a fluorescent label on the 
target. Although a fluorescent label is probably most 
convenient, other sorts of labels, e.g., radioactive, enzyme 
linked, optically detectable, or spectroscopic labels may be 
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used. Because the oligonucleotide probes are positionally 
defined, the location of the hybridized duplex will directly 
translate to the sequences which hybridize. Thus, upon 
analysis of the positions provides a collection of subsequences 
5 found within the target sequence. These subsequences are 
matched with respect to their overlaps so as to assemble an 
intact target sequence. 

BRIEF DESCRIPTION OF THE FIGURES 
10 Fig. 1 illustrates a flow chart for sequence, 

fingerprint, or mapping analysis. 

Fig. 2 illustrates the proper function of a VLSIPS 
nucleotide synthesis. 

Fig. .3 illustrates the proper function of a VLSIPS 
15 dinucleotide synthesis. 

Fig. 4 illustrates the process of a VLSIPS 
trinucleotide synthesis. 

DESCRIPTION OF THE PREFERRED EMBODIMENTS 

20 I. Overall Description 

A. general 

B. VLSIPS substrates 

C. binary masking 

D . applications 

25 E. detection methods and apparatus 

F- data analysis 

II. Theoretical Analysis 

A. simple n-mer structure; theory 
30 B. complications 

III. Polynucleotide Snquencing 

A. preparation of substrate matrix 

B. labeling target polynucleotide 
35 C. hybridization conditions 

D. detection; VLSIPS scanning 

E. analysis 

F. substrate reuse 

40 IV. Fingerprinting 

A- general 

B. preparation of substrate matrix 

C. labeling target nucleotides 

D. hybridization conditions 
45 E. detection; VLSIPS scanning 

F . analysis 

G. substrate reuse 

H. other polynucleotide aspects 
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V. Mapping 

A. general 

B. preparation of substrate matrix 

C. labeling 

5 D. hybridization/ specific interaction 

E. detection 

F. analysis 

G* substrate reuse 

10 VI. Additional Screening 

A. specific interactions 

B. sequence comparisons 
. c. categorizations 

D. statistical correlations 

15 

VII. Formation of Substrate 

A. instrumentation 
. B- binary masking 

C. synthetic methods 

20 D.. surface immobilization 

VIII. Hybridization/Specific Interaction 

A. general 

B. important parameters 

25 

IX. Detection Methods 

A. labeling techniques 

B. scanning system 

30 X. Data Analysis 

A. general 

B. hardware 

C. software 

35 XI. Substrate Reuse 

A. removal of label 

B. storage and preservation 

C. processes to avoid degradation of oligomers 

40 XII. Integrated Sequencing Strategy 

A. initial mapping strategy 

B. selection of smaller clones 

XIII. Commercial Applications 

45 A. sequencing 

B. fingerprinting 

C. mapping 

* * * 
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I. OVERALL DESCRIPTION 
A. General 

The present invention relies in part on the ability 
to synthesize or attach specific recognition reagents at known 
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locations on a substrate, typically a single substrate. In 
particular, the present invention provides the ability to 
prepare a substrate having a very high density matrix pattern 
of positionally defined specific recognition reagents . The 
5 reagents are capable cf interacting with their specific targets 
while attached to the substrate, e.g., solid phase 
interactions, and by appropriate labeling of these targets, the 
sites of the interactions between the target and the specific 
reagents may be derived. Because the reagents are positionally 
10 defined, the sites of the interactions will define the 

specificity of each interaction. As a result, a map of the 
patterns of interactions with specific reagents on the 
substrate is convertible into information on the specific 
interactions taking place, e.g., the recognized features. 
15 Where the specific reagents recognize a large number of 

possible features, this system allows the determination of the 
combination of specific interactions which exist on the target 
molecule. Where the number of features is sufficiently large, 
the identical same combination, or pattern, of features is 
20 sufficiently unlikely that a particular target molecule may 

often be uniquely defined by its features. In the extreme, the 
features may actually be the subunit sequence of the target 
molecule, and a given target sequence may be uniquely defined 
by its combination of features. 
25 In particular, the methodology is applicable to 

sequencing polynucleotides. The specific sequence recognition 
reagents will typically be oligonucleotide probes which 
hybridize with specificity to subsequences found on the target 
sequence. A sufficiently large number of those probes allows . 
30 the fingerprinting of a target polynucleotide or the relative 
mapping of a collection of target polynucleotides, as described 
in greater detail below. 

In the high resolution fingerprinting provided by a 
saturating collection of probes which include all possible. 
35 subsequences of a given size, e.g., 10-mers, collating of all 

the subsequences and determination of specific overlaps will be 
derived and the entire sequence can usually be reconstructed. 
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Sequence analysis may take the form of complete 
sequence determination, to the level of the sequence of 
individual subunits along the entire length of the target 
sequence. Sequence analysis also may take the form of sequence 
homology, e.g., less than absolute subunit resolution, where 
' "similarity" in the sequence will be detectable, or the form of 
selective sequences of homology interspersed at specific or 
irregular locations. 

In either case, the sequence is determinable at 
selective resolution or at particular locations. Thus, the 
hybridization method will be useful as a means for 
identification, e.g., a "fingerprint", much like a Southern 
hybridization method is used. It is also useful to map 
particular target sequences. 

B. VLSIPS Substrates 

The invention is enabled by the development of 
technology to prepare substrates on which specific reagents may 
be either positionally attached or synthesized. In particular, 
the very large scale immobilized polymer synthesis (VLSIPS) 
technology allows for the very high density production of an 
enormous diversity of reagents mapped out in a known matrix 
pattern on a substrate. These reagents specifically recognize 
subsequences in a target polymer and bind thereto, producing a 
map of positionally defined regions of interaction. These map 
positions are convertible into actual features recognized, and 
thus would be present in the target molecule of interest. 

As indicated, the sequence specific recognition 
reagents will often be oligonucleotides which hybridize with 
fidelity and discrimination to the target sequence. 

In the generic sense, the VLSIPS technology allows 
the production of a substrate with a high density matrix of 
positionally mapped regions with specific recognition reagents 
attached at each distinct region. By use of protective groups 
which can be positionally removed, or added, the regions can be 
activated or deactivated for addition of particular reagents or. 
compounds. Details of the protection are described below and 
in PCT publication no. W090/15070, published December 13, 1990. 
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In a preferred embodiment, photosensitive protecting agents 
will be used and the regions of activation or deactivation may 
be controlled by electro-optical and optical methods, similar 
to many of the processes used in semiconductor wafer and chip 
5 fabrication. 

In the nucleic acid nucleotide sequencing 
application, a VLSIPS substrate is synthesized having 
posit ionally defined oligonucleotide probes. See PCT 
publication no. WO90/15070, publisned December 13, 1990; and 
10 U.S. S.N. 07/624,120, filed December 6,' 1990. By use of masking 
technology and photosensitive synthetic subunits, the VLSIPS 
apparatus allows for the stepwise synthesis of polymers 
according to a posit ionally defined matrix pattern. Each 
oligonucleotide probe will be synthesized at known and defined 
15 positional locations on the substrate. This forms a matrix 

pattern of known relationship between position and specificity 
of interaction. The VLSIPS technology allows the production of 
a very large number of different oligonucleotide probes to be 
simultaneously and automatically synthesized including numbers 
20 in excess of about 10 2 , 10 3 , 10 4 , 10 5 , 10 6 , or even more, and at 
densities of at least about 10 2 , 10 3 /cm 2 , io 4 /cm 2 , 10 5 /cm 2 and up 
to 10 6 /cm 2 or more. This application discloses methods for 
synthesizing polymers on a silicon or other suitably 
derivatized substrate, methods and chemistry for synthesizing 
25 specific types of biological polymers on those substrates, 

apparatus for scanning and detecting whether interaction has , 
occurred at specific locations on the substrate, and various 
other technologies related to the use of a high density very 
large scale immobilized polymer substrate. In particular, 
30 sequencing, fingerprinting, and mapping applications are 

discussed herein in detail, though related technologies are 
described in U.S. S.N. 07/624,120, filed December 6, 1990; and 
PCT/US91/02989, filed May 1, 1991, each of which is hereby 
incorporated herein by reference. 
35 The regions which define particular reagents will 

usually be generated by selective protecting groups which may 
be activated or deactivated. Typically the protecting group 
will be bound to a monomer subunit or spatial region, and can 
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be spatially affected by an activator, such as electromagnetic 
radiation. Examples of protective groups with utility herein 
include nitroveratryl oxycarbonyl (NVOC) , nitrobenzyl 
oxycarbony (NBOC) or a,a-dimethyl-dimethoxybenzyl oxycarbonyl 
5 (DEZ) . 

C. Binary Masking 

In fact, the means for producing a substrate useful 
for these techniques are explained in U.S.S.N. 07/492,462 

10 (VLSIPS CIP) , which is hereby incorporated herein by reference. 
However, there are various particular ways to optimize the 
synthetic processes. Many of these methods are described in 
. U.S. S.N. 07/624,120. 

Briefly, the binary synthesis strategy refers to an 

15 ordered strategy for parallel synthesis of diverse polymer 
sequences by sequential addition of reagents which may be 
represented by a reactant matrix, and a switch matrix, the 
product of which is a product matrix. A reactant matrix is a 
1 x n matrix of the building blocks to be added. The switch 

20 matrix is all or a subset of the binary numbers from 1 to n 
arranged in columns. In preferred embodiments, a binary 
strategy is one in which at least two successive steps 
illuminate half of a region of interest on the substrate. In 
most preferred embodiments, binary synthesis refers to a 

25 synthesis strategy which also factors a previous addition step. 
For example, a strategy in which a switch matrix for a masking 
strategy halves regions that were previously illuminated, 
illuminating about half of the previously illuminated region 
and protecting the remaining half (while also protecting about 

30 half of previously protected regions and illuminating about 
half of previously protected regions) . It will be recognized 
that binary rounds may be interspersed with non-binary rounds 
and that only a portion of a substrate may be subjected to a 
binary scheme, but will still be considered to be a binary 
. 35 masking scheme within the definition herein. A binary 

"masking" strategy is a binary synthesis which uses light to 
remove protective groups from materials for addition of other 
materials such as nucleotides. 
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In particular, this procedure provides a simplified 
and highly efficient method for saturating all possible 
sequences of a defined length polymer. This masking strategy 
is also particularly useful in producing all possible 
5 oligonucleotide sequence probes of a given length. 

D. Applications 

The technology provided by the present invention has 
very broad applications. Although described specifically for 

10 polynucleotide sequences, similar sequencing, fingerprinting, 
mapping, and screening procedures may be applied to 
polypeptide, carbohydrate, or other polymers. This may be for 
de novo sequencing, or may be used in conjunction with a second 
sequencing procedure to provide independent verification. See, 

15 e.g., (1988) Science 242:1245.' For example, a large 

, polynucleotide sequence defined by either the Maxam and Gilbert 
technique or by the Sanger technique may be verified by using 
the present invention. 

In addition, by selection of appropriate probes, a 

20 polynucleotide sequence can be fingerprinted. Fingerprinting 
is a less detailed sequence analysis which usually involves the 
characterization of a sequence by a combination of defined 
features. Sequence fingerprinting is particularly useful 
because the repertoire of possible features which can be tested 

25 is virtually infinite. Moreover, the stringency of matching is 
also variable depending upon the application. A Southern Blot 
analysis may be characterized as a means of simple fingerprint 
analysis. 

Fingerprinting analysis may be performed to the 
30 resolution of specific nucleotides, or may be used to determine 
homologies, most commonly for large segments. In particular, 
an array of oligonucleotide probes of virtually any workable 
size may be positionally localized on a matrix and used to 
probe a sequence for either absolute complementary matching, or 
35 homology to the desired level of stringency using selected 
hybridization conditions. 

In addition, the present invention provides means for 
mapping analysis of a target sequence or sequences;. Mapping 
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will usually involve the sequential ordering^ of a plurality of 
various sequences, or may involve the localization of a 
particular sequence within a plurality of sequences. This may 
be achieved by immobilizing particular large segments onto the 
5 matrix and probing with a shorter sequence to determine which 
of the large sequences contain that smaller sequence. 
Alternatively, relatively shorter probes of known or random 
sequence may be immobilized to the matrix and a map of various 
different target sequences may be determined from overlaps. 

10 Principles of such an approach are described in some detail by 
Evans et al. (1989) "Physical Mapping of Complex Genomes by 
Cosmid Multiplex Analysis," Proc. Na tl. Acad. Sci. USA 86:5030- 
5034; Michiels et al. (1987) ^'Molecular Approaches to Genome 
Analysis: A Strategy for the Construction of Ordered Overlap 

15 Clone Libraries*" CABIOS 3:203-210; Olsen et al. (1986) 

"Random-Clone Strategy for Genomic Restriction Mapping in 
yeast," Proc. Natl. Acad. Sci. USA 83:7826-7830; Craig, et al. 
(1990) "Ordering of Cosmid Clones Covering the Herpes Simplex 
Virus Type I (HSV-I) Genome: A Test Case for Fingerprinting by 

20 Hybridization," Nuc. Acids Res. 18:2653-2660; and Coulson, et 

al. (1986) "Toward a Physical Map of the Genome of the Nematode 
Caenorhabditis elegans," Proc. Natl. Acad. Sci. USA 83:7821- 
7825; each of which is hereby incorporated herein by reference. 
Fingerprinting analysis also provides a means of 

25 identification. In addition to its value in apprehension of 
criminals from whom a biological sample, e.g., blood, has been 
collected, fingerprinting can ensure personal identification 
for other reasons. For example, it may be useful for 
identification of bodies in tragedies such as fire, flood, and 

30 vehicle crashes. In other cases the identification may be 

useful in identification of persons suffering from amnesia, or 
of missing persons. Other forensics applications include 
establishing the identity of a person, e.g., military 
identification "dog tags", or may be used in identifying the 

35 source of particular biological samples. Fingerprinting 

technology is described, e.g., in Carrano, et al. (1989) "A 
High-Resolution, Fluorescence-Based, Semi-automated method for 
DNA Fingerprinting, 11 Genomics 4: 129-136, which is hereby 
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incorporated herein by reference. See, e.g., table I, for 
nucleic acid applications. 

TABLE I 

5 VtelPS PRQJECT IV NUgt^IC ACIDS 

I. Construction of Chips 

II. Applications 

A. Sequencing 

1. Primary sequencing 
10 2. Secondary sequencing (sequence checking) 

3 . Large scale mapping 

4. Fingerprinting 

B. Duplex/Triplex formation 
1. Antisense 

15 2. Sequence specific function modulation 

(e.g. promoter inhibition) 

C. Diagnosis 

1. Genetic markers 

2 . Type markers 

20 a. Blood donors 

b. Tissue transplants 

D. Microbiology 

1. - Clinical microbiology 

2. Food microbiology 

25 

III. Instrumentation 

A. Chip machines 

B . Detection 

30 IV. Software Development 

A. Instrumentation software 

B. Data reduction software 

C. Sequence analysis software 

35 The fingerprinting analysis may be used to perform 

various types of genetic screening. For example, a single 
substrate may be generated with a plurality of screening 
probes, allowing for the simultaneous genetic screening for a 
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large number of genetic markers. Thus, prenatal or diagnostic 
screening can be simplified, economized, and made more 
generally accessible. 

In addition to the sequencing, fingerprinting, and 
5 mapping applications, the present invention also provides means 
for determining specificity of interaction with particular 
sequences. Many of these applications were described in 
U.S. S.N. 07/362,901 ( VLS IPS parent) , U. S.S.N. 07/492,462 
(VLSIPS CIP), U. S.S.N. 07/435,316 (caged biotin parent), and 
10 U.S. S.N. 07/612,671 (caged biotin CIP) . 

E. Detection Methods and Apparatus 
An appropriate detection method applicable to the 
selected labeling method can be selected. Suitable labels 

15 include radionucleotides , enzymes, substrates, cof actors, 
inhibitors, magnetic particles, heavy metal atoms, and 
particularly fluorescers, chemiluminescers , and spectroscopic 
labels. Patents teaching the use or such labels include U.S. 
Patent Nos. 3,817,837; 3,850,752; 3,939,350; 3,996,345; 

20 4,277,437; 4,275,149; and 4,366,24i. 

With an appropriate label selected, the detection 
system best adapted far high resolution and high sensitivity 
detection may be selected. As indicated above, an optically 
detectable system, e.g., fluorescence or chemiluminescence 

25 would be preferred* Other detection systems may be adapted to 
the purpose, e.g., electron microscopy, scanning electron 
microscopy (SEM) , scanning tunneling electron microscopy 
(STEM) , infrared microscopy, atomic force microscopy (AFM) , 
electrical condutance, and image plate transfer. 

30 With a detection method selected, an apparatus for 

scanning the substrate will be designed. Apparatus, as 
described in PCT publication no. W090/15070, published December 
13, 1990; or U.S. S.N. 07/624,120, filed December 6, 1990, are 
particularly appropriate. Design modifications may also be 

35 incorporated therein. 
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F. pata Analysis 

Data is analyzed by processes similar to those 
described below in the section describing theoretical analysis. 
More efficient algorithms will be mathematically devised, and 
5 will usually be designed to be performed, on a computer. 

Various computer programs which may more quickly or efficiently 
make measurement samples and distinguish signal from noise will 
also be devised. See, particularly, U.S. S.N. 07/624,120. 

The initial data resulting from the detection system 

10 is an array of data indicative of fluorescent intensity versus 
location on the substrate. The data are typically taken over 
regions substantially smaller than the area in which synthesis 
of a given polymer has taken place. Merely by way of example, 
if polymers were synthesized in squares on the substrate having 

15- dimensions of 500 microns by 500 microns, the data may be taken 
over regions having dimensions of 5 microns by 5 microns. In 
most preferred embodiments, the regions over which florescence 
data are taken across the substrate are less than about 1/2 the 
area of the regions in which individual polymers are 

20 synthesized, preferably less than 1/10 the area in which a 
single polymer is synthesized, and most preferably less than 
1/100 the -area in which a single polymer is synthesized. 
Hence, within any area in which a given polymer has been 
synthesized, a large number of fluorescence data points are 

25 collected. 

A plot of number of pixels versus intensity for a 
scan should bear a rough resemblance to a bell curve, but 
spurious data are observed, particularly at higher intensities. 
Since it is desirable to use an average of fluorescent 

30 intensity over a given synthesis region in determining relative 
binding affinity, these spurious data will tend to undesirably 
skew the data. 

Accordingly, in one embodiment of the invention the 
data are corrected for removal of these spurious data points, 

35 and an average of the data points is thereafter utilized in 
determining relative binding efficiency. In general the data 
are fitted to a base curve and statistically measures are used 
to remove spurious data. 



WO 92/10588 



18 



P^K91/09226 



In an additional analytical tool, various degeneracy 
reducing analogues may be incorporated in the hybridization 
probes. Various aspects of this strategy are described, e.g. , 
in Macevicz, S. (1990) PCT publication number WO 90/04652, 
5 which is hereby incorporated herein by reference. 

II. THEORETICAL ANALYSIS 

The principle of the hybridization sequencing 
procedure is based, in part, upon the ability to determine 
10 overlaps of short segments. The VLSIPS technology provides the 
ability to generate reagents which will saturate the possible 
short subsequence recognition possibilities. The principle is 
most easily illustrated by using a binary sequence, such as a 
.sequence of zeros and ones. Once having illustrated the 
15 application to a binary alphabet, the principle may easily be 

understood to encompass three letter, four letter, five or more 
letter, even 20 letter alphabets. A theoretical treatment of 
analysis of subsequence information to reconstruction of a 
target sequence is provided, e.e., in Lysov, Yu. , et al. (1988) 
20 Dokladv Akademi. Nauk. SSR 303:1508-1511; Khropko K. , et al. 
(1989) FEBS Letters 256:118-122; Pevzner, P. (1989) J. of 
. Biomolecular Structure and Dynamics 7:63-69; and Drmanac, R. et 
al. (1989) Genomics 4:114-128; each of which is hereby 
incorporated herein by reference. 
25 The reagents for recognizing the subsequences will 

usually be specific for recognizing a particular polymer 
subsequence anywhere within a target polymer. It is preferable 
that conditions may be devised which allow absolute 
discrimination between high fidelity matching and very low 
30 levels of mismatching. The reagent interaction will preferably 
exhibit no sensitivity to flanking sequences, to the 
subsequence position within the target > or to any other remote 
structure within the sequence. 

A. Simple n-mer Structure: Theory . 

35 ' 1. Simple two letter alphabet: example 

A simple example is presented below of how a sequence 
of ten digits comprising zeros and ones would be sequenceable 
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using short segments of five digits. For example, consider the 
sample ten digit sequence: 

1010011100. 

A VLSIPS substrate could be constructed, as discussed 
5 elsewhere, which would have reagents attached in a defined 
matrix pattern which specifically recognize each of the 
possible five digit sequences of ones and zeros. The number of 
possible five digit subsequences is 2 s = 32. The number of . 
possible different sequences 10 digits long is 2 10 = 1,024. The 

10 five contiguous digit subsequences within a ten digit sequence 
number six, i.e., positioned at digits 1-5, 2-6, 3-7, 4-8, 5-9, 
and 6-10. It will be noted that the specific order of the 
digits in the sequence is important and that the order is 
directional, e.g., running left to right versus right to left. 

15 The first five digit sequence contained in the target sequence 
is 10100. The second is 01001, the third is 10011, the fourth 
is 00111, the fifth is OHIO, and the sixth is 11100. 

The VLSIPS substrate would have a matrix pattern of 
positionally attached reagents which recognize each of the 

20 different 5-mer subsequences. Those reagents which recognize 
each of the 6 contained 5-mers will bind the target, and a 
label allows the positional determination of where the sequence 
specific interaction has occurred. By correlation of the 
position in the matrix pattern, the corresponding bound 

25 subsequences can be determined. 

In the above-mentioned sequence, six different 5-mer 

sequences would be determined to be present. They would be: 

10100 
01001 

30 10011 

00111 
OHIO 
11100 

Any sequence which contains the first five digit 
35 sequence, 10100, already narrows the number of possible 

sequences (e.g., from 1024 possible sequences) which contain it 
to less than about 192 possible sequences. 

This 192 is derived from the observation that with 
the subsequence 10100 at the far left of the sequence, in 
40 positions 1-5, there are only 32 possible sequences. Likewise, 
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for that particular subsequence in positions 2-6, 3-7, 4-8, 5- 
9, and 6-10. So, to sum up all of the sequences that could - 
contain 10100, there are 32 for each position and 6 positions 
for a total of about 192 possible sequences. However, some of 
5 these 10 digit sequences will have been counted twice. Thus, 
by virtue of containing the 10100 subsequence, the number of 
possible 10-mer sequences has been decreased from 1024 
sequences to less than about 192 sequences. 

In this example, not only do we know that sequence 

10 contains 10100, but we also know that it contains the second 
five character sequence, 01001. By virtue of knowing that the 
sequence contains 10100, we can look specifically to determine 
whether the sequence contains a subsequence of five characters 
which contains the four leftmost digits plus a next digit to 

15 the left. For example, we would look for a sequence of X1010, 
but we find that there is none. Thus, we know that the 10100 
must be at the left end of the 10-mer. We would also look to 
see whether the sequence contains the rightmost four digits 
plus a next digit to the right, e.g., 0100X. We find that the 

20 sequence also contains the sequence 01001, and that X is a 1. 
Thus, we know at. least that our target sequence has an overlap 
of 0100 and has the left terminal sequence 101001. 

Applying the same procedure to the second 5-mer,, we 
also know that the sequence must include a sequence of five 

25 digits having the sequence 100 1Y where Y must be either 0 or 1. 
We look through the fragments and we see that we have a 10011 
sequence within our target, thus Y is also 1. Thus, we would 
know that our sequence has a sequence of the first seven being 
1010011. 

30 Moving to the next 5-mer, we know that there must be 

a sequence of 0011Z, where Z must be either 0 or 1. We look at 
the fragments produced .above and see that the target sequence 
contains a 00111 subsequence and Z is 1. Thus, we know the 
sequence must start with 10100111. 

35 The next 5-mer- must be of the sequence 0111W where W 

must be 0 or 1. Again, looking up at the fragments produced, 
we see that the target sequence contains a OHIO subsequence, 
and W is a 0. Thus, our sequence to this point is 101001110. 
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We Know that the last 5-mer must be either 11100 or 11101. 
Looking above, we see that it is 11100 and that must be the 
last of our sequence. Thus, we have determined that our 
sequence must have been 1010011100. 
5 However, it will be recognized from the example above 

with the sequences provided therein, that the sequence analysis 
can start with any known positive probe subsequence. The 
determination may be performed by moving linearly along the 
sequence checking the known sequence with a limited number of 
10 next positions. Given this possibility, the sequence may be 
determined, besides by scanning all possible oligonucleotide 
probe positions, by specifically looking only where the next 
possible positions would be. This may increase the complexity 
of the scanning but may provide a longer time span dedicated. .. 
15 towards scanning and detecting specific positions of interest 
, relative to other sequence possibilities. Thus, the scanning 
apparatus could be set up to work its way along a sequence from 
a given contained oligonucleotide to only look at those 
positions on the substrate which are expected to have a 
20 positive signal. 

It is seen that given a sequence, it can be de- 
constructed into n-mers to produce a set of internal contiguous 
subsequences. From any given target sequence, we would be able 
to determine what fragments would result. The hybridization 
25 sequence method depends, in part, upon being able to work in 
the reverse, from a set of fragments of known sequences to the 
full sequence. In simple cases, one is able to start at a 
single position and work in either or both directions towards 
the ends of the sequence as illustrated in the example. 
30 The number of possible sequences of a given length 

increases very quickly with the length of that sequence. Thus, 
a 10-mer of zeros and ones has 1024 possibilities, a 12-mer has 
4096. A 20-mer has over a million possibilities, and a 30-mer 
has over a billion. However, a given 30-mer has, at most, 26 
35 different internal 5-mer sequences. Thus, a 30 character 

target sequence having over a million possible sequences can be 
substantially defined by only 26 different 5-mers. It will be 
recognized that the probe oligonucleotides will preferably, but 
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need not necessarily, be of identical length, and that the 
probe sequences need not necessarily be contiguous in that the 
overlapping subsequences need not differ by only a single 
subunit. Moreover, each position of the matrix pattern need 
5 not be homogeneous, but may actually contain a plurality of 
probes of known sequence. In addition, although all of the 
possible subsequence specifications would be preferred, a less 
than full set of sequences specifications could be used. In 
particular, although a substantial fraction will preferably be 
10 at least about 70%, it may be less than that. About 20% would 
be preferred, more preferably at least about 30% would be 
desired. Higher percentages would be especially preferred. 

2. Example of four letter alphabet 

15 A four letter alphabet may be conceptualized in at 

least two different ways from the two letter alphabet. One 
way, is to consider the four possible values at each position 
and to analogize in a similar fashion to the binary example 
each of the overlaps. A second way is to group the binary 

20 digits into groups. 

Using the first means, the overlap comparisons are 
performed with a four letter alphabet rather than a two letter 
alphabet. Then, in contrast to the binary system with 10 
positions where 2 10 = 1024 possible sequences, in a 4 -character 

5 alphabet with 10 positions, there will actually be 4 10 = 

1,048,576 possible sequences. Thus, the complexity of a four 
character sequence has a much larger number of possible 
sequences compared to a two character sequence. Note, however, 
that there are still only 6 different internal 5-mers. For 

D simplicity, we shall examine a 5 character string with 3 
character subsequences . Instead of only 1 and 0, the 
characters may be designated, e.g., A, C, G, and T. Let us 
take the sequence GGCTA. The 3-mer subsequences are: 

GGC 
GCT 
CTA 

Given these subsequences , there is one sequence, or at most 
only a few sequences which would produce that combination of 
subsequences , i.e., GGCTA . 
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e.g., a two character probe for the binary system. Each 
possible two character probe would be exp cted to appear h of 
the time in every single two character position. Thus, the 
above sequence example would be recognized by each of the 00, 
5 10, 01 f and 11. Thus, the sequence information is virtually 
lost because the resolution is too low and each recognition 
reagent specifically binds at multiple sites on the target 
sequence. 

The number of different probes which bind to a target 
10 depends on the relationship between the probe length and the 

target length. At the extreme of short probe length, the just 
mentioned problem exists of excessive redundancy and lack of 
resolution. The lack of stability in recognition will also be 
a problem with extremely short probes. At the extreme of long 
15 probe length, each entire probe sequence is on a different 
position of a substrate. However, a problem arises from the 
number of possible sequences, which goes up dramatically with 
the length of the sequence. Also, the specificity of 
recognition begins to decrease as the contribution to binding 
20 by any particular subunit may become sufficiently low that the 
system fails to distinguish the fidelity of recognition. 
Mismatched hybridization may be a problem with the 
polynucleotide sequencing applications, though the 
fingerprinting and mapping applications may not be so strict in 
25 their fidelity requirements. As indicated above, a thirty 

position binary sequence has over a million possible sequences, 
a number which starts to become unreasonably large in its 
required number of different sequences, even though the target 
length is still very short. Preparing a substrate with all 
30 sequence possibilities for a long target may be extremely 

difficult due to the many different oligomers which must be 
synthesized. . 

The above example illustrates how a long target 
sequence may be reconstructed with a reasonably small number of 
35 shorter subsequences. Since the present day resolution of the 
regions of the substrate having defined oligomer probes j 
attached to the substrate approaches about 10 microns by 10 
microns for resolvable regions, about 10 6 , or 1 million, 
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positions can be placed on a one centimeter square substrate . 
However, high resolution systems may have particular 
disadvantages which may be outweighed using the lower density 
substrate matrix pattern. For this reason, a sufficiently 
5 large number of probe sequences can be utilized so that any 
given target sequence may be determined by hybridization to a 
relatively small number of probes. , 

A second complication relates to convergence of 
sequences to a single subsequence. This will occur when a 

10 particular subsequence is repeated in the target sequence. 

This problem can be addressed in at least two different ways. 
The first, and simpler way, is to separate the repeat sequences 
onto two different targets. Thus, each single target will not 
have the repeated sequence and can be analyzed to its end. 

15 This solution, however, complicates the analysis by requiring 
that some means for cutting at a site between the repeats can 
be located. Typically a careful sequencer would want to have 
two intermediate cut points so that the intermediate region can 
also be sequenced in both directions across each of the cut 

20 points. This problem is inherent in the hybridization method 
for sequencing but can be minimized by using a longer known 
probe sequence so that the frequency of probe repeats is 
decreased. 

Knowing the sequence of flanking sequences of the 
25 repeat will simplify the use of polymerase chain reaction (PCR) 
or a similar technique to further definitively determine the 
sequence between sequence repeats. Probes can be made to 
hybridize to those known sequences adjacent the repeat 
sequences, thereby producing new target sequences for analysis. 
30 See, e.g., Innis et al. (eds.) (1990) PCR Protocols; A Guide 
to Methods and Applications , Academic Press; and methods for 
synthesis of oligonucleotide probes, see, e.g., Gait (1984) 
Oligonucleotide Synthesis; A Practical Approach . IRL Press, 
Oxford. 

35 Other means for dealing with convergence problems 

include using particular longer probes, and using degeneracy 
reducing analogues, see, e.g., Macevicz, S. (1990) PCT 
publication number WO 90/04652, which is hereby incorporated 
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III. POLYNUCLEOTIDE SEQUENCING 

In principle, the making of a substrate having a 
positionally defined matrix pattern of all possible 
5 oligonucleotides of a given length involves a conceptually 

simple method of synthesizing each and every different possible 
oligonucleotide, and affixed to a definable position. 
Oligonucleotide synthesis is presently mechanized and enabled 
by current technology, see, e.g., U.S. S.N. 07/362,901 (VLSIPS 
10 parent); U.S. S.N. 07/492,462 (VLSIPS CIP) ; and instruments 
supplied by Applied Biosystems, Foster City, California. 

A. Preparation of Substrate Matrix 

The production of the collection of specific 

15 oligonucleotides used in polynucleotide sequencing may be 

produced in at least two different ways. Present technology 
certainly allows production of ten nucleotide oligomers on a 
solid phase or other synthesizing system. See, e.g., 
instrumentation provided by Applied Biosystems, Foster City, 

20 California. Although a single oligonucleotide can be 

relatively easily made, a large collection of them would 
typically require a fairly large amount of time and investment. 
For example, there are 4 10 = 1,048,576 possible ten nucleotide 
oligomers. Present technology allows making each and every one 

25 of them in a separate purified form though such might be costly 
and laborious. 

Once the desired repertoire of possible oligomer 
sequences of a given length have been synthesized, this 
collection of reagents may be individually positionally 

30 attached to a substrate, thereby allowing a batchwise 

hybridization step. Present technology also would allow the 
possibility of attaching each and every one of these 10-mers to 
a separate specific position on a solid matrix. This 
attachment could be automated in any of a number of ways, 
5 particularly use of a caged biotin type linking. This would 
produce a matrix having each of different possible 10-mers. 

A batchwise hybridization is much preferred because 
of its reproducibility and simplicity. An automated process of 
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attaching various reagents to positionally defined sites on a 
substrate is provided in PCT publication no. W090/15070; 
U.S. S.N. 07/624,120; and PCT publication no. W091/O7087; each 
of which is hereby incorporated herein by reference. 
5 Instead of separate synthesis of each 

oligonucleotide, these oligonucleotides are conveniently 
synthesized in parallel by sequential synthetic processes on a 
defined matrix pattern as provided in PCT publication no. 
WO90/15070; and U.S. S.N. 07/624,120, which are incorporated 

10 herein by reference. Here, the oligonucleotides are 

synthesized stepwise on a substrate at positionally separate 
and defined positions. Use of photosensitive blocking reagents 
allows for defined sequences of synthetic steps over the 
surface of a matrix pattern. By use of the binary masking 

15 strategy, the surface of the substrate can be positioned to 
generate a desired pattern of regions, each having a defined 
sequence oligonucleotide synthesized and immobilized thereto. 

Although the prior art technology can be used to 
generate the desired repertoire of oligonucleotide probes, an 

20 efficient and cost effective means would be to use the VLSIPS 
technology described in PCT publication no. W090/15070 and 
U. S.S.N. 07/624,120. In this embodiment, the photosensitive 
reagents involved in the production of such a matrix are 
described below. 

25 The regions for synthesis may be very small, usually 

less than about 100 fim x 100 jim, more usually less than about 
50 fim x 50 /un. The photolithography technology allows 
synthetic regions of less than about 10 /xm x 10 /tin, about 
3 /im x 3 Mm, or less. The detection also may detect such sized 

30 regions, though larger areas are more easily and reliably 
measured. 

At a size of about 30 microns by 30 microns, one 
million regions would take about 11 centimeters square or a 
single wafer of about 4 centimeters by 4 centimeters. Thus the 
35 present technology provides for making a single matrix of that 
size having all one million plus possible oligonucleotides. 
Region size are sufficiently small to correspond to densities 
of at least about 5 regions/ cm 2 , 20 regions/ cm 2 , 50 regions/ cm 2 , 
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100 regions/cm 2 , and greater, including 300 regions/cm 2 , 1000 
regions/cm 2 , 3K regions/ cm 2 , 10K regions/ cm 2 , 3 OK regions/cm 2 , 
100K regions/cm 2 , 300K regions/ cm 2 or more, even in excess of 
one million regions/cm 2 . 
5 Although the pattern of the regions which contain 

specific sequences is theoretically not important, for 
practical reasons certain patterns will be preferred. in 
synthesizing the oligonucleotides. The application of binary 
masking algorithms for generating the pattern of known , 

10 oligonucleotide probes is described in related U.S. S.N. 

07/624,120. By use of these binary masks, a highly efficient 
means is provided for producing the substrate with the desired 
matrix pattern of different sequences. Although the binary 
masking strategy allows for the synthesis of all lengths of 

15 polymers, the strategy may be easily modified to provide only 
polymers of a given length. This is achieved by omitting steps 
where a subunit is not attached. 

The strategy for generating a specific pattern may 
take any of a number of different approaches. These approaches 

20 are well described in related application U.S. S.N. 07/624,120 
and include a number of binary masking approaches which will 
not be exhaustively discussed herein. However, the binary 
masking and binary synthesis approaches provide a maximum of 
diversity with a minimum number of actual synthetic steps. 

25 The length of oligonucleotides used in sequencing 

applications will be selected on criteria determined to some 
extent by the practical limits discussed above. For example, 
if probes are made as oligonucleotides, there will be 65,536 
possible eight nucleotide sequences. If a nine subunit 

30 oligonucleotide is selected, there are 262,144 possible 

permeations of sequences. If a ten-mer oligonucleotide is 
selected, there are 1,048,576 possible permeations of 
sequences. As the number gets larger, the required number of 
positionally defined subunits necessary to saturate the 

35 possibilities also increases. With respect to hybridization 
conditions, the length of the matching necessary to converse 
stability of the conditions selected can be compensated for. 
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Then, the target may be bound to the whole collection 
of beads \ and those beads that have appropriate specific 
reagents on them will bind to target. Then a sorting system 
may be utilized to sort those beads that actually bind the 
5 target from those that do not. This may be acc mplished by 
presently available cell sorting devices or a similar 
apparatus. After the relatively small number of beads which 
have bound the target have been collected, the encoding scheme 
may be read off to determine the specificity of the reagent on 
10 the bead. An encoding system may include a magnetic system, a 
shape encoding system/ a color encoding system, or a 
combination of any of these, or any other encoding system. 
Once again, with the collection of specific interactions that 
have occurred, the binding may be analyzed for sequence 
15 information, fingerprint information, or mapping information. 

The parameters of polynucleotide sizes of both the 
probes and target sequences are determined by the applications 
and other circumstances. The length of the oligonucleotide 
probes used will depend in part upon the limitations of the 
20 VLSIPS technology to provide the number of desired probes. For 
example, in an absolute sequencing application, it is often 
useful to have virtually all of the possible oligonucleotides 
of a given length. As indicated above, there are 65,536 8- 
mers, 262,144 9-mers, 1,048,576 10-mers, 4,194,304 11-mers, 
25 etc. As the length of the oligomer increases the number of 

different probes which must be synthesized also increases at a 
rate of a factor of 4 for every additional nucleotide. 
Eventually the size of the matrix and the limitations in the 
resolution of regions in the matrix will reach the point where 
30 an increase in number of probes becomes disadvantageous. 

However, this sequencing procedure requires that the system be 
able to distinguish, by appropriate selection of hybridization 
and washing conditions, between binding of absolute fidelity 
and binding of complementary sequences containing mismatches. 
35 On the other hand, if the fidelity is unnecessary, this 

discrimination is also unnecessary and a significantly longer 
probe may be used. Significantly longer probes would typically 
be useful in fingerprinting or mapping applications. 
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performed by methods applicable to them as recognized by a 
person having ordinary skill in manipulating the corresponding 
polymer. 

In some embodiments, th target need not actually be. 
5 labeled if a means for detecting where interaction takes place 
is available. As described below, for a nucleic acid 
embodiment, such may be provided by an intercalating dye which 
intercalates only into double stranded segments , e.g., where 
interaction occurs. See, e.g., Sheldon et al. U.S. Pat. No. 
10 4,582,789. 

In many uses, the target sequence will be absolutely 
homogeneous, both with respect to the total sequence and with 
respect to the ends of each molecule. Homogeneity with respect 
to sequence is important to avoid ambiguity. It is preferable 
" 15 that the target sequences of interest not be contaminated with 
a significant amount of labeled contaminating sequences. The 
extent of allowable contamination will depend on the 
sensitivity of the detection system and the inherent signal to 
noise of the system. Homogeneous contamination sequences will 

20 be particularly disruptive of the sequencing procedure. 

However, although the target polynucleotide must have 
a unique sequence, the target molecules need not have identical 
ends. In fact, the homogeneous target molecule preparation may 
be randomly sheared to increase the numerical number of 

25 molecules. Since the total information content remains the 

same, the shearing results only in a higher number of distinct 
sequences which may be labeled and bind to the probe. This 
fragmentation may give a vastly superior signal relative to a 
preparation of the target molecules having homogeneous ends. 

30 The signal for the hybridization is likely to be dependent on 
the numerical frequency of the target-probe interactions. If a 
sequence is individually found on a larger number of separate 
molecules a better signal will result. In fact, shearing a 
homogeneous preparation of the target may often be preferred 

35 before the labeling procedure is performed, thereby producing a 
large number of labeling groups associated with each 
subsequence. 
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C. Hybridization Conditions 

The hybridization conditions between probe and target 
should be selected such that the specific recognition 
interaction, i.e., hybridization, of the two molecules is both 
5 sufficiently specific and sufficiently stable. -See, e.g., 

Haines and Higgins (1985) Nucleic Acid Hybri disations A 

Practical Approach , IRL Press, Oxford. These conditions will 
be dependent both on the specific sequence and often on the 
guanine and cytosine (GC) content of the complementary hybrid 

10 strands. The conditions may often be selected to be 
universally equally stable independent of the specific 
sequences involved. This typically will make use of a reagent 
such as an arylammonium buffer. See, Wood et al. (1985) "Base 
Composition- independent Hybridization in Tetramethylammonium 

15 Chloride: A Method for Oligonucleotide Screening of Highly 

Complex Gene Libraries," Proc. Natl. Acad. Sci. USA, 82:1585- 
1588; and Krupov et al. (1989) "An Oligonucleotide 
Hybridization Approach to DNA Sequencing, " FEBS Letters, 
256:118-122; each of which is hereby incorporated herein by 

20 reference. An arylammonium buffer tends to minimize 

differences in hybridization rate and stability due to .GC 
content. By virtue of the fact that sequences then hybridize 
with approximately equal affinity and stability, there is 
relatively little bias in strength or kinetics of binding for 

25 particular sequences. Temperature and salt conditions along 
with other buffer parameters should be selected such that the 
kinetics of renaturation should be essentially independent of 
the specific target subsequence or oligonucleotide probe 
involved. In order to ensure this, the hybridization reactions 

3 0 will usually be performed in a single incubation of all the 
substrate matrices together exposed to the identical same 
target probe solution under the same conditions. 

Alternatively, various substrates may be individually 
treated differently. Different substrates may be produced, 

35 each having reagents which bind to target subsequences with 
substantially identical stabilities and kinetics of 
hybridization. For example, all of the high GC content probes 
could be synthesized on a single substrate which is treated 
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in the target sequence, this data may be aligned by overlap to 
reconstruct the entire sequence of the target, as illustrated 
above. 

It is also possible to dispense with actual labeling 
5 if some means for detecting the positions of interaction 

between the sequence specific reagent and the target molecule 
are available. This may take the form of an additional reagent 
which can indicate the sites either of interaction, or the 
sites of lack of interaction, e.g. ,- a negative label. For the 

10 nucleic acid embodiments, locations of double strand 
interaction may be detected by the incorporation of 
intercalating dyes, or other reagents such as antibody or other 
reagents that recognize helix formation, see, e.g., Sheldon, et 
al. (1986) U.S. Pat. No. 4,582,789, which is hereby 

15 incorporated herein by reference. 

E. Analysis 

Although the reconstruction can be performed manually 
as illustrated above, a computer program will typically be used 

20 to perform the overlap analysis. A program may be written and 
run on any of a large number of different computer hardware 
systems. The variety of operating systems and languages 
useable will be recognized by a computer software engineer. 
Various different languages may be used, e.g., BASIC; C; 

25 PASCAL; etc. A simple flow chart of data analysis is 
illustrated in Figure 4. 

F. Substrate Reuse 

Finally, after a particular sequence has been 
30 hybridized and~ the pattern of hybridization analyzed, the 

matrix substrate should be reusable and readily prepared for 
exposure to a second or subsequent target polynucleotides. In 
order to do so, the hybrid duplexes are disrupted and the 
matrix treated in a way which removes all traces of the 
35 original target. The matrix may be treated with various 
detergents or solvents to which the substrate, the 
oligonucleotide probes, and the linkages to the substrate are 
inert. This treatment may include an elevaited temperature 
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synthesizing procedures such as, e.g. , polymerase chain 
reaction. 

In one embodiment, the individually isolated probes 
may be attached to the matrix at defined positions. These 
5 probe reagents may be attached by an automated process making 
use of the caged biotin methodology described in U.S. S.N. 
07/612,671 (caged biotin CIP) , or using photochemical reagents, 
see, e.g., Dattagupta et al. (1985) U.S. Pat. No. 4,542,102 and 
(1987) U.S. Pat. No. 4,713,326. Each individual purified 
10 reagent can be attached individually at specific locations on a 
substrate. 

In another embodiment, the VLSIPS synthesizing 
technique may be used to synthesize the desired probes at 
specific positions on a substrate. The probes may be 

15 synthesized by successively adding appropriate monomer 

subunits, e.g., nucleotides, to generate the desired sequences. 

In another embodiment, a relatively short specific 
oligonucleotide is used which serves as a targeting reagent for 
positionally directing the sequence recognition reagent. For 

20 example, the sequence specific reagents having a separate 

additional sequence recognition segment (usually of a different 
polymer from the target sequence) can be directed to target 
oligonucleotides attached to the substrate. By use of non : 
natural targeting reagents, e.g., unusual nucleotide analogues 

25 which pair with other unnatural nucleotide analogues and which 
do not interfere with natural nucleotide interactions, the 
natural and non-natural portions can coexist on the same 
molecule without interfering with their individual 
functionalities. This can combine both a synthetic and 

30 biological production system analogous to the technique for 
targeting monoclonal antibodies to locations on a VLSIPS 
substrate at defined positions. Unnatural optical isomers of 
nucleotides may be useful unnatural reagents subject to similar 
chemistry, but incapable of interfering with the natural 

35 biological polymers. See also, U.S. S.N. 07/626,730, filed 
December 6, 1990; which is hereby incorporated herein by 
reference . 
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information may be arrived at with less stringent hybridization 
conditions while providing valuable fingerprinting information. 
However, since the entire substrate is typically exposed to the 
target molecule at one time, the binding affinity of the probes 
5 should usually be of. approximately comparable levels. For this 
reason, if oligonucleotide probes are being used, their lengths 
should be approximately comparable and will be selected to 
hybridize under conditions which are common for most of the 
probes on the substrate. Much as in a Southern hybridization, 

10 the target and oligonucleotide probes are of lengths typically 
greater than about 25 nucleotides. Under appropriate 
hybridization conditions, e.g., typically higher salt and lower 
temperature, the probes will hybridize irrespective of 
imperfect complementarity. In fact, with probes of greater 

15 than, e.g., about fifty nucleotides, the difference in 

stability of different sized probes will be relatively minor. 

Typically the fingerprinting is merely for probing 
similarity or homology. Thus, the stringency of hybridization 
can usually be decreased to fairly low levels. See, e.g., 

20 Wetmur and Davidson (1968) "Kinetics of Renaturation of DNA," 
J. Mol. Biol. , .31:349-370; and Kanehisa, M. (1984) Nuc. Acids 
Res. . 12:203-213. 

E. Detection: VLSIPS Scanning 
25 Detection methods will be selected which are 

appropriate for the selected label. The scanning device need 
not necessarily be digitized or placed into a specific digital 
database, though such would most likely be done. For example, 
the analysis in fingerprinting could be photographic. Where a 
30 standardized fingerprint substrate matrix is used, the pattern 
of hybridizations may be spatially unique and may be compared 
photographically. In this manner, each sample may have a 
characteristic pattern of interactions and the likelihood of 
identical patterns will preferably be such low frequency that 
35 the fingerprint pattern indeed becomes a characteristic pattern 
virtually as unique as an individual's fingertip fingerprint. 
With a standardized substrate, every individual could be, in 
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tested for their expression of particular mRNA sequences, or 
for patterns of expressed mRNA species. This may be applicable 
to a cell or tissue type, to the expressed messenger RNA 
population expressed by a cell to th genetic content of a 
5 cell. 

RNA can be isolated from a cell or a cell population, 
such as a purified cell fraction or a biopsy sample. The RNA 
may be labeled, for example by attaching a fluorescent molecule 
to isolated RNA or by using radiolabeled RNA (e.g., end-labeled 

10 with T4 polynucleotide kinase) . A VLSIPS substrate containing 
positionally discrete oligonucleotide sequences may then be 
exposed to the pool of labeled RNA species under conditions 
permitting specific hybridization. The pattern of positions at 
which labeled RNA has formed specific hybrids may be compared 

15 to a reference pattern to identify, and in some embodiments 
quantify, the expressed RNA species, or to identify the 
hybridization pattern itself as being characteristic of a 
particular cell type. 

For example but not for limitation, a VLSIPS 

20 oligonucleotide substrate may be hybridized to a labeled RNA 
sample obtained from a first cell type (e.g., human 
lymphocytes) to establish a reference hybridization pattern for 
the first cell type. Similarly, an identical VLSIPS 
oligonucleotide substrate may be hybridized to a labeled RNA 

25 sample obtained from a second cell type (e.g., human monocytes) . 
to establish a reference hybridization pattern for the second 
cell type. Labeled RNA may then be prepared from a cell or a 
cell population and hybridized to an identical VLSIPS 
oligonucleotide substrate, and the resultant hybridization 

30 pattern can be compared to the reference hybridization patterns 
established for the first and second cell types. By such 
comparisons, the RNA expression pattern of a cell or cell 
population can be identified as being similar to or distinct 
from one or more reference hybridization patterns. 

35 Where a positionally discrete oligonucleotide on the 

VLSIPS substrate is in molar excess over the amount of the 
cognate (complementary) labeled RNA species in the 
hybridization reaction, the amount of specific hybridization to 
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that VLSIPS locus (as measured by labeling intensity at that 
locus) can provide a quantitative measurement of the cognate 
RNA species present in the labeled RNA sample. Thus, 
hybridization of labeled RNA to a VLSIPS oligonucleotide 
5 substrate can provide information identifying the individual 
RNA species that are expressed in a particular cell or cell 
population, as well as the relative abundance of one or more 
individual RNA species. This information can serve to 
fingerprint specific cell types or particular stages in cell 
10 differentiation. 

For example but not for limitation, RNA samples 
prepared from tissue biopsies, specifically tumor biopsies, can 
be labeled and hybridized to a VLSIPS oligonucleotide 
substrate, and the resultant hybridization pattern can provide 
15 information regarding cell type, degree of differentiation, and 
metastic potential (malignancy) . Some of the positionally 
distinct oligonucleotides may hybridize specifically with RNA 
species transcribed from endogenous proto-oncogens (e.g., c- 
myc, c-ras 11 , c-sis, etc.) which are, in certain instances, 
20 transcribed at elevated levels in neoplastic tissues. 

In addition to diagnostic applications, labeled RNA 
samples from various neoplastic cell types may be hybridized to 
VLSIPS oligonucleotide substrates and the resultant 
hybridization pattern (s) compared to reference patterns 
25 obtained with RNA from related, non-neoplastic cell types. 
Identification of distinctions between the hybridization 
patterns obtained with RNA from neoplastic cells as compared to 
patterns obtained from RNA from non-neoplastic cells may be of 
diagnostic value and may identify RNA species that encode 
30 proteins that are potential targets for novel therapeutic 
modalities. In fact, the high resolution of the test will 
allow more complete characterization of parameters which define 
particular diseases. Thus, the power of diagnostic tests may 
be limited by the extent of statistical correlation with a 
35 particular condition rather than with the number of RNA species 
which are tested. The present invention provides the means to 
generate this large universe of possible reagents and the 
ability to actually accumulate that correlative data. 
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For fingerprinting of RNA expression patterns, the 
VLSIPS substrate polynucleotides will be at least 12 
nucleotides in length, preferably at least 15 nucleotides in 
length, more preferably at least 25 nucleotides in length. The 
5 sequences of the positionally distinct polynucleotides on the 
VLSIPS substrate may be selected from published sources of 
sequence data, including but not limited to computerized 
, database such as GenBank, and may or may not include random or 
pseudorandom sequences for detecting RNA species which have not' 
10 yet been identified in the art. Fingerprint analysis of RNA 
expression patterns will typically employ high-stringency 
washes so as to provide hybridization patterns that reflect 
predominantly specific hybridization. However, some non- 
specific hybridization and/or cross-hybridization to slightly 
15 mismatched sequences may be tolerated, and in some embodiments 
may be desirable. 

The ability to generate a high density means for 
screening the presence or absence of specific interactions 
allows for the possibility of screening for, if not saturating, 
20 all of a very large number of possible interactions. This is 
very powerful in providing the means for testing the 
combinations of molecular properties which can define a class 
of samples. For example, a species of organism may be 
characterized by its DNA sequences, e.g., a genetic 
25 fingerprint. By using a fingerprinting method, it may be 

determined that all members of that species are sufficiently 
similax in specific sequences that they can be easily 
identified as being within a particular group. Thus, newly 
defined classes may be resolved by their similarity in 
30 fingerprint patterns. Alternatively, a non-member of that 
group will fail to share those many identifying 
characteristics. However, since the technology allows testing 
0 f a very large number of specific interactions, it also 
provides the ability to more finely distinguish between closely 
35 related different cells or samples. This will have important 
applications in diagnosing viral, bacterial, and other 
pathological on nonpathological infections. 



WO 92/10588 PC«M>l/09226 

45 

In particular, cell classification may be defined by 
any of a number of different properties. For . example, a cell 
class may be defined by its DNA sequences contained therein. 
This allows species identification for parasitic or other 
5 infections. For example, the human cell is presumably 

genetically distinguishable from a monkey cell, but different 
human cells will share many genetic markers. At higher 
resolution, each individual human genome will exhibit unique 
sequences that can define it as a single individual. 
10 Likewise, a developmental stage of a cell type may be 

definable by its pattern of expression of messenger RNA. For 
example, in particular stages of cells, high levels of 
ribosomal RNA are found whereas relatively low levels of other 
types of messenger RNAs may be found. The high resolution 
"15 distinguishabillty provided by this fingerprinting method 
allows the distinction between cells which have relatively 
minor differences in its expressed mRNA population. Where a 
pattern is shown to be characteristic of a stage, a stage may 
be defined by that particular pattern of messenger RNA 
20 expression. 

In another embodiment, a substrate as provided herein 
may be used for genetic screening. This would allow for 
simultaneous screening of thousands of genetic markers. As the 
density of the matrix is increased; many more molecules can be 
25 simultaneously tested. Genetic screening then becomes a 

simpler method as the present invention provides the ability to 
screen for thousands, tens of thousands, and hundreds of 
thousands, even millions of different possible genetic 
features. However, the number of high correlation genetic 
30 markers for conditions numbers only in the hundreds. Again, 
the possibility for screening a large number of sequences 
provides the opportunity for generating the data which can 
provide correlation between sequences and specific conditions 
or susceptibility. The present invention provides the means to 
35 generate extremely valuable correlations useful for the genetic 
detection of the causative mutation leading to medical 
conditions. In still another embodiment, the present invention 
would be applicable to distinguishing two individuals having 
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identical genetic compositions. The antibody population within 
an individual is dependent both on genetic and historical 
factors. Each individual experiences a unique exposure to 
various infectious agents, and the combined antibody expression 
5 is partly determined thereby. Thus, individuals may also be 
fingerprinted by their lymphocyte DNA or RNA hybridization 
pattern (s). Similar sorts of immunological and environmental 
histories may be useful for fingerprinting, perhaps in 
combination with other screening properties. 

10 with the definition of new classes of cells, a cell 

sorter will be used to purify them. Moreover, new markers for 
defining that class of cells will be identified. For example, 
where the class is defined by its RNA content, cells may be 
screened by antisense probes which detect the presence or 

15 absence of specific sequences therein. Alternatively, cell 
lysates may provide information useful in correlating 
intracellular properties with extracellular markers which 
indicate functional differences. Using standard cell sorter 
technology with a fluorescence or labeled antisense probe which 

20 recognizes the internal presence of the specific sequences of 
interest, the cell sorter will be able to isolate a relatively 
homogeneous population of cells possessing the particular 
marker. Using successive probes the sorting process should be 
able to select for cells having a combination of a large number 

25 of different markers. 

With the fingerprinted method as in identification 
means arises from mosaism problems in an organism. A mosaic 
organism is one whose genetic content in different cells is 
significantly different. Various clonal populations should 

30 have. similar genetic fingerprints, though different clonal 
populations may have different genetic contents. See, for 
example, Suzuki et al. An Introduction to Genetic Analysis (4th 
Ed.), Freeman and Co., New York, which is hereby incorporated 
herein by reference. However, this problem should be a 

35 relatively rare problem and could be more carefully evaluated 
with greater experience using the fingerprinting methods. 

The invention will also find use in detecting 
changes, both genetic and. in protein expression (i.e., by RNA 
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expression fingerprinting) , in a rapidly "evolving" protozoan 
infection, or similarly changing organism. 

V, MAPPING 

5 a. general 

The use of the present invention for mapping 
parallels its use for fingerprinting and seguencing. Mapping 
provides the ability to locate particular segments along the 
length of the polynucleotide. The mapping provides the ability 

10 to locate, in a relative sense, the order of various 
subsequences . This may be achieved using at least two 
different approaches. 

The first approach is to take the large sequence and 
fragment it at specific points.. The fragments are then ordered 

15 and attached to a solid substrate. For example, the clones 

resulting from a chromosome walking process may be individually 
attached to the substrate by methods, e.g., caged biotin 
techniques, indicated earlier. Segments of unknown map 
position will be exposed to the substrate and will hybridize to 

20 the segment which contains that particular sequence. This 
procedure allows the rapid determination of a number of 
different labeled segments, each mapping requiring only a 
single hybridization step once the substrate is generated. The 
substrate may be regenerated by removal of the interaction, and 

25 the next mapping segment applied. 

In an alternative method, a plurality of subsequences 
can be attached to a substrate. Various short probes may be 
applied to determine which segments may contain particular 
overlaps. The theoretical basis and a description of this 

30 mapping procedure is contained in, e.g., Evans et al. 1989 
"Physical Mapping of Complex Genomes by Cosmid Multiplex 
Analysis," Proc. Natl. Acad. Sci. USA 86:5030-5034, and other 
references cited above in the Section labeled "Overall 
Description." Using this approach, the details of the mapping 
35 embodiment are very similar to those used in the fingerprinting 
embodiment. 
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B. Preparation of Substrate Matrix 
The substrate may be generated in either of the 
methods generally applicable in the sequencing and 
fingerprinting embodiments. The substrate may be made either 
5 synthetically, or by attaching otherwise purified probes or 
sequences to the matrix. The probes or sequences may be 
derived either from synthetic or biological means. As 
indicated above, the solid phase substrate synthetic methods 
may be utilized to generate a matrix with posit ionally defined 

10 sequences. In the mapping embodiment, the importance of 

saturation of all possible subsequences of a preselected length 
is far less important than in the sequencing embodiment, but 
the length of the probes used may be desired to be much longer. 
The processes for making a substrate which has longer 

15 oligonucleotide probes should not be significantly different 
from those described for the sequencing embodiments, but the 
optimization parameters may be modified to comply with the 
mapping needs. 

20 C. Labeling 

The labeling methods will be similar to those 
applicable in sequencing and fingerprinting embodiments. 
Again, the target sequences may be desired to be fragmented. 

25 D. Hybridization/Specific In teraction 

The specificity of interaction between "the targets 
and probe would typically be closer to those used for 
fingerprinting embodiments, where homology is more important 
than absolute distinguishability of high fidelity complementary 

30 hybridization. Usually, the hybridization conditions will be 
such that merely homologous segments will interact and provide 
a positive signal. Much like the fingerprinting embodiment, it 
may be useful to measure the extent of homology by successive 
incubations at higher stringency conditions. Or, a plurality 

35 of different probes, each having various levels of homology may 
be used. In either way, the spectrum of homologies can be 
measured. 



WO 92/10588 



49 



PC]^B>l/09226 



E. pet?ction 

The detection methods used in the mapping procedure 
will be virtually identical to those used in the fingerprinting 
5 embodiment. The detection methods will be selected in 
combination with the labeling methods. 

F. Analysis 

The analysis of the data in a mapping embodiment will 
10 typically be somewhat different from that in fingerprinting. 
The fingerprinting embodiment will test for the presence or 
absence of specific or homologous segments. However, in the 
mapping embodiment, the existence of an interaction is coupled 
with some indication of the location of the interaction. The 
15 interaction is mapped in some manner to the physical polymer 

sequence. Some means for determining the relative positions of 
different probes is performed. This may be achieved by 
synthesis of the substrate in pattern, or may result from 
analysis of sequences after they have been attached to the 
20 substrate. 

For example, the probes may be randomly positioned at 
various locations on the substrate. However, the relative 
positions of the various reagents in the original polymer may 
be determined by using short fragments, e.g*, individually, as 

25 target molejpules which determine the proximity of different 

probes. By an automated system of testing each different short 
fragment of the original polymer, coupled with proper analysis, 
it will be possible to determine which probes are adjacent one 
another on the original target sequence and correlate that with 

30 positions on the matrix. In this way, the matrix is useful for 
determining the relative locations of various new segments in 
the original target molecule. This sort of analysis is 
described in Evans, and the related references described above. 

In another form of mapping, as described above in the 

35 fingerprinting section, the developmental map of a cell or 
biological system may be measured using fingerprinting type 
technology. Thus, the mapping may be along a temporal 
dimension rather than along a polymer dimension. The mapping 
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directly attaching each' reagent at each desired position, the 
reagent may be attached to a specific desired complementary 
oligonucl otide which will ultimately be specifically directed 
toward locations on the matrix having a complementary 
5 oligonucleotide attached thereat. 

In addition, the technology allows screening for 
specificity of interaction with particular reagents. For 
example, the oligonucleotide sequence specificity of binding of 
a potential reagent may be tested by presenting to the reagent 

10 all of the possible subsequences available for binding. 

Although secondary or higher order sequence specific features 
might not be easily screenable using this technology, it does 
provide a convenient, simple, quick, and thorough screen of 
interactions between a reagent and its target recognition 

15 sequences. See, e.g., Pfeifer et al. (1989) Science 246:810- 
812. 

For example, the interaction of a promoter protein . 
with its target binding sequence may be tested for many 
different, or all, possible binding sequences. By testing the 

20 strength of interactions under various different conditions, 
the interaction of the promoter protein with each of the 
different potential binding sites may be analyzed. The 
spectrum of strength of interactions with each different 
potential binding site may provide significant insight into the 

25 types of features which are important in determining 
specificity. 

An additional example of a sequence specific 
interaction between reagents is the testing of binding of a 
double stranded nucleic acid structure with a single stranded 

30 oligonucleotide. Often, a triple stranded structure is 

produced which has significant aspects of sequence specificity. 
Testing of such interactions with either sequences comprising 
only natural nucleotides, or perhaps the testing of nucleotide 
analogs may be very important in screening for particularly 

35 useful diagnostic or therapeutic reagents. See, e.g., Haner 
and Dervan (1990) Biochemistry 29:9761-6765, and references 
therein. 
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available by this technology, purified classes of cells having 
identifiable differences in RNA expression and/or DNA structure 
are -made available. 

In an alternative embodiment; subclasses of T-calls 
5 are defined, in part, upon the combination of expressed cell 
surface RNA species. The present invention allows for the 
simultaneous screening of a large plurality of different RNA 
species together. Thus, higher resolution classification of 
different T-cell subclasses becomes possible and, with the 

10 definitions and functional differences which correlate with 

those other parameters, the ability to purify those cell types 
becomes available. This is applicable not only to T-cell s, 
lymphocyte cells, or even to freely circulating cells. Many of 
the cells for which this would be most useful _will be immobile . 

15 cells found in particular tissues or organs. Tumor cells will 
be diagnosed or detected using these fingerprinting techniques. 
Coupled with a temporal change in structure, developmental 
classes may also be selected and defined using these 
technologies. The present invention also provides the ability 

20 not only to define new classes of cells based upon functional 
or structural differences, but it also provides the ability to 
select or purify populations of cells which share these 
particular properties. In particular, antisense DNA or RNA 
molecules may be introduced into a cell to detect RNA sequences 

25 therein. See, e.g., Weintraub (1990) Scientific American 
262:40-46. 

Statistical Correlations 
In an additional embodiment, the present invention 

30 also allows for the high resolution correlation of medical 
conditions with various different markers. For example, the 
present technology, when applied to amniocentesis or other * 
genetic screening methods, typically screen for tens of 
different markers at most. The present invention allows 

35 simultaneous screening for tens, hundreds, thousands, tens of 
thousands, hundreds of thousands, and even millions of 
different genetic sequences. Thus, applying the fingerprinting 
methods of the present invention to a sufficiently large 
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population allows detailed statistical analysis to be made, 
thereby correlating particular medical conditions with 
particular markers, typically genetic markers or pathognomonic 
RNA expression patterns. Tumor-specific RNA expression patterns 
5 and particular RNA species characterizing various neoplastic 
phenotypes will be identified using the present invention. 

Various medical conditions may be correlated against 
an enormous data base of the sequences within ah individual. 
Genetic propensities and correlations then become available and 

10 high resolution genetic predictability and correlation become 
much more easily performed. With the enormous data base, the 
reliability of the predictions also is better tested. 
Particular markers which are partially diagnostic of particular 
medical conditions or medical susceptibilities will be 

15 identified and provide direction in further studies and more 
careful analysis of the markers involved. Of course, as 
indicated above in the sequencing embodiment, the present 
invention will find much use in intense sequencing projects. 
For example, sequencing of the entire human genome in the human 

20 genome project will be greatly simplified and enabled by the 
present invention. 

VI. FORMATION OF SUBSTRATE 

The substrate is provided with a pattern of specific 

25 reagents which are positionally localized on the surface of the 
substrate. This matrix of positions is defined by the 
automated system which produces the substrate. The instrument 
will typically be one similar to that described in PCT 
publication no. WO90/15070, and U.S. S.N. 07/624,120. .The 

30 instrumentation described therein is directly applicable to the 
applications used here. In particular, the apparatus comprises 
a substrate, typically a silicon containing substrate, on which 
positions on the surface may be defined by a coordinate system 
of positions. These positions can be individually addressed or 

35 detected by the VLSIPS apparatus. 

Typically, the VLSIPS apparatus uses optical methods 
used in semiconductor fabrication applications. In this way, 
masks may be used to photo-activate positions for attachment or 
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synthesis of specific sequences on the substrate. These 
manipulations may be automated by the types of apparatus 
described in PCT publication no. WO90/15070 and U. S.S.N. 
07/624,120. 

5 Selectively removable protecting groups allow 

creation of well defined areas of substrate surface having 
differing reactivities. Preferably, the protecting groups are 
selectively removed from the surface by applying a specific 
activator , such as electromagnetic radiation of a specific 
10 wavelength and intensity. More preferably, the specific 
activator exposes selected areas of surface to remove the 
protecting groups in the exposed areas. 

Protecting groups of the present invention are used 
in conjunction with solid phase oligonucleotide syntheses using 
15 deoxyribonucleic and ribonucleic acids. In addition to 

protecting the substrate surface from unwanted reaction, the 
protecting groups block a reactive end of the monomer to 
prevent self-polymerization. 

Attachment of a protecting group to the 5' -hydroxy 1 
20 group of a nucleoside during synthesis using for example, 

phosphate-triester coupling chemistry, prevents the 5 1 -hydroxy 1 
of one nucleoside from reacting with the 3' -activated 
phosphate-triester of another. 

Regardless of the specific use, protecting groups are 
25 employed to protect a moiety, on a molecule from reacting with 
another reagent. Protecting groups of the present invention 
have the following characteristics: they prevent selected 
reagents from modifying the group to which they are attached; 
they are stable (that is, they remain attached) to the 
30 synthesis reaction conditions; they are removable under 
conditions that do not adversely affect the remaining 
structure; and once removed, do not react appreciably with the 
surface or surface-bound oligonucleotide. 

In a preferred embodiment, the protecting groups will 
35 be photoactivatable. The properties and uses of photoreactive 
protecting compounds have been reviewed. See, McCray et al.,. 
Ann. Rev, of Biophvs. and Biophvs. Chem. (1989) 18:239-270, 
which is incorporated herein by reference. Preferably, the 
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By use of photo-lithographic optical methods, particular 
segments of the substrate can be irradiated with light to 
activate or deactivate blocking agents, e.g., to protect or 
deprotect particular chemical groups. By an appropriate 
5 sequence of photo-exposure steps at appropriate times with 

appropriate masks and with appropriate reagents, the substrates 
can have known polymers synthesized at positionally defined 
regions on the substrate. Methods for synthesizing various 
substrates are described in PCT publication no. WO90/15070 and 

10 U.S. S.N. 07/624,120. By a sequential series of these photo- 
exposure and reaction manipulations, a defined matrix pattern 
of known sequences may be generated, and is typically referred 
to as a VLSIPS substrate. In the nucleic acid synthesis 
embodiment, nucleosides used in the synthesis of DNA by 

15 photolytic methods will typically be one of the two forms shown 
below: 
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B = Adenine, Cytosine, Guanine, or Thymine 
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Cytosine (C) 

20 Adenine (A) 

Other amides of the general formula 



Guanine (G), 



A, 



where R may be alXyl or aryl have been used. 
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Another type of protecting group FMOC (9-fluorenyl 
methoxycarbonyl) is currently being used to protect the 
exocyclic amines of the three bases: 




Adenine (A) 





Guanine (G) 
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30 



The advantage of the FMOC group is that it is removed 
under mild conditions (dilute organic bases) and can be used 
for all three bases. The amide protecting groups require more 
harsh conditions to be removed (NH 3 /MeOH with heat) . 

Nucleosides used as 5' -OH probes, useful in verifying 
correct VLSIPS synthetic function, have been the following: 
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The groups can be removed by mild base treatment 0 . IN 
NaOH/MeOH or K 2 C0 3 /H 2 0/MeOH . 

Another group used most often is the silyl ether. 



10 ^ 



These groups can be removed by neutral conditions 
using 1 M tetra-n-butyl ammonium fluoride in THF. or under acid 
15 conditions. - 

Related to photodeprotection, the nitroveratryl group 
could also be used to protect the 3' -position. 
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Here, light (photolysis) would be used to remove 
these protecting groups. 

A variety of ethers can also be used in the 
protection of the 3 1 -O-position. 
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graft bead or fiber, which is covalently coated with an organic 
layer (hydrophilic) terminating in hydroxy 1 sites (commercially 
available from Molecular Brosystems, Inc.) This would offer 
the sam advantage as the Durapore™ membrane, allowing for 
5 immediate phosphate linkages, but would give additional contour 
by the 3-dimensional growth of oligomers, 

A matrix pattern of new reagents may be targeted to 
each specific oligonucleotide position by attaching a 
complementary oligonucleotide to which the substrate bound form 

10 is complementary. For instance, a number of regions may have 
homogeneous oligonucleotides synthesized at various locations. 
Oligonucleotide sequences complementary to each of these can be 
individually generated and linked to a particular specific 
' reagents. Often these specific reagents will be antibodies. 

15 As each of these is specific for finding its complementary 
oligonucleotide, each of the specific reagents will bind 
through the oligonucleotide to the appropriate matrix position. 
A single step having a combination of different specific 
reagents being attached specifically to a particular 

20 oligonucleotide will thereby bind to its complement at the 

defined matrix position. The oligonucleotides will typically 
then be covalently attached, using, e.g., an acridine dye f for 
photocrossl inking. Psoralen is a commonly used acridine dye 
for photocrosslinking purposes, see, e.g., Song et al. (1979) 

25 Photochem. Photobiol. 29:1177-1197; Cimino et al. (1985) Ann^. 
Rev. Biochem . 54:1151-1193; Parsons (1980) Photochem. 
Photobiol* 32:813-821; and Dattagupta et al. (1985) U.S. Pat. 
No. .4,542,102, and (1987) U.S. Pat. No. 4,713,326; each of 
which is hereby incorporated herein by reference. This method 

3 0 allows a single attachment manipulation to attach all of the 
specific reagents to the matrix at defined positions and 
results in the specific reagents being homogeneously located at 
defined positions. 

35 D. Surface Immobilization 

1. caged biotin s 
An alternative method of attaching reagents in a 
positionally defined matrix pattern is to use a caged biotin 
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VIII. HYBRIDIZATION/SPECIFIC INTERACTION 

A. General 

As discussed previously in the VLSIPS parent 
applications, the VLSIPS substrates may be used for screening 
5 for specific interactions with sequence specific targets or 
probes . 

in addition, the availability of substrates having 
the entire repertoire of possible sequences of a defined length 
opens up the possibility of sequencing by hybridization. This 

10 sequence may be de novo determination of an unknown sequence, 
particularly. of nucleic acid, verification of a sequence 
determined by another method, or an investigation of changes in 
a previously sequenced gene, locating and identifying specific 
changes. For example, often Maxam and Gilbert sequencing 

15 techniques are applied to sequences which have been determined 
by Sanger and Coulson. Each of those sequencing technologies 
have problems with resolving particular types of. sequences. 
Sequencing by hybridization may serve as a third and 
independent method for verifying other sequencing techniques. 

20 See, e.g., (1988) Science 242:1245. 

In addition, the ability to provide a large 
repertoire of particular sequences allows use of short 
subsequence and hybridization as a means to fingerprint a 
polynucleotide sample. For example, fingerprinting to a high 

25 degree of specificity of sequence matching may be used for 

identifying highly similar samples, e.g., those exhibiting high 
homology to the selected probes. This may provide a means for 
determining classifications of particular sequences. This 
should allow determination of whether particular genomes of 

30 bacteria, phage, or even higher cells might be related to one 
another. 

In addition, fingerprinting may be used to identify 
an individual source of biological sample. See, e.g. , Lander, 
E. (1989) "Nature, 339:501-505, and references therein. For 
35 example, a DNA fingerprint may be used to determine whether a 
genetic sample arose from another individual. This would be 
particularly useful in various sorts of forensic tests to 
determine, e.g., paternity or sources of blood samples. 
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Significant detail on the particulars of genetic fingerprinting 
for identification purposes are described in, e.g., Morris et 
al. (1989) "Biostatistical evolution of evidence from 
continuous allele frequency distribution DNA probes in 
5 reference to disputed paternity of identity . " J . Forens ic 
Science 34:1311-1317; and Neufeld et al. (1990). Scientific 
American 262:46-53; each of which is hereby incorporated herein 
by reference. 

In another embodiment, a f ingerprinting-like 

10 procedure may be used for classifying cell types by analyzing a 
pattern of specific nucleic acids present in the cell, 
specifically RNA expression patterns. This may also be useful 
in defining the temporal stage of development of cells, e.g., 
stem cells or other cells which undergo temporal changes in 

15 development. For example", "the stage of a cell, or group of 
cells, may be tested or defined by isolating a sample of mRNA 
from the population and testing to see what sequences are 
present in messenger populations. Direct samples, or amplified 
samples (e.g., by polymerase chain reaction), may be used. 

20 Where particular mRNA or other nucleic acid sequences may be 
characteristic of or shown to be characteristic of particular 
developmental stages, physiological states, or other 
conditions, this fingerprinting method may define them. 

The present invention may also be used for mapping 

25 sequences within a larger segment. This may be performed by at 
least two methods, particularly in reference to nucleic acids. 
Often, enormous segments of DNA are subcloned into a large 
plurality of subsequences. Ordering these subsequences may be 
important in determining the overlaps of sequences upon 

30 nucleotide determinations. Mapping may be performed by 

immobilizing particularly large segments onto a matrix using 
the VLSIPS technology. Alternatively, sequences may be ordered 
by virtue of subsequences shared by overlapping segments. See, 
e.g., Craig et al. (1990) Nuc. Acids Res. 18:2653-2660; 

35 Michiels et al. (1987) CABIOS 3:203-210; and Olson et al. 
(1986) Proc. Natl. Acad. Sci. USA 83:7826-7830. 
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optimized for the system in question. Dextran sulfate is often 
included at a concentration of between 0.5% and 2% by weight or 
dextran sulfate at a concentration between about 0.5% and 5%. 
Alternatively/ proteins which accelerate hybridization may be 
5 added, e.g., the recA protein found in E. coli) or other 
homologous proteins. 

Of course, the specific hybridization conditions will 
be selected to correspond to a discriminatory condition which 
provides a positive signal where desired but fails to show a 

.10 positive signal at affinities where interaction is not desired. 
This may be determined by a number of titration steps or with a 
number of controls which will be run during the hybridization 
and/or washing steps to determine at what point the 
hybridization conditions have reached the stage of desired 

15 specificity. 

IX. DETECTION METHODS 

Methods for detection depend upon the label selected. 
The criteria for selecting an appropriate label are discussed 

20 below, however, a fluorescent label is preferred because of its 
extreme sensitivity and simplicity. Standard labeling 
procedures are used to determine the positions where 
interactions between a sequence and a reagent take place. For 
example, if a target sequence is labeled and exposed to a 

25 matrix of different probes, only those locations where probes 
do interact with the target will exhibit any signal. 
Alternatively, other methods may be used to scan the matrix to 
determine where interaction takes place. Of course, the 
spectrum of interactions may be determined in a temporal manner 

30 by repeated scans of interactions which occur at each of a 

multiplicity of conditions. However, instead of testing each 
individual interaction separately, a multiplicity of sequence 
interactions may be simultaneously determined on a matrix. 

35 A. Labeling Techniques 

The target polynucleotide may be labeled by any of a 
number of convenient detectable markers. A fluorescent label 
is preferred because it provides a very strong signal with low 
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background. It is also optically detectable at high resolution 
and sensitivity through a quick scanning procedure. Other 
potential labeling moieties include, radioisotopes, 
chemiluminescent compounds, labeled binding proteins, heavy 
5 metal atoms, spectroscopic markers, magnetic labels, and linked 
enzymes. 

Another method for labeling does not require 
incorporation of a labeling moiety. The target may be exposed 
to the probes, and a double strand hybrid is formed at those 

10 positions only. Addition of a double strand specific reagent 
will detect where hybridization takes place. An intercalative 
dye such as ethidium bromide may be used as long as the probes 
themselves do not fold back on themselves to a significant 
extent forming hairpin loops. See, e.g., Sheldon et'al. (1986) 

15 U.S. Pat. No. 4,582,789. However, the length of the hairpin 
loops in short oligonucleotide probes would typically be 
insufficient to form a stable duplex. 

In another embodiment, different targets may be 
simultaneously sequenced where each target has a different 

20 label. For instance, one target could have a green fluorescent 
label and a second target could have a red fluorescent label. 
The scanning step will distinguish sites of binding of the red 
label from those binding the green fluorescent label. Each 
sequence can be analyzed independently from one another. 

25 Suitable chromogens will include molecules and 

compounds which absorb light in a distinctive range of 
wavelengths so that a color may be observed, or emit light when 
irradiated with radiation of a particular wave length or wave 
length range, e.g., fluorescers. 

30 A wide variety of suitable dyes are available, being 

primary chosen to provide an intense color with minimal 
absorption by their surroundings. Illustrative dye types 
include quinoline dyes, triarylmethane dyes, acridine dyes, 
alizarine dyes, phthaleins, insect dyes, azo dyes, 

35 anthraquinoid dyes, cyanine dyes, phenazathionium dyes, and 
phenazoxonium dyes. 

A wide variety of fluorescers may be employed either 
by themselves or in conjunction with quencher molecules. 
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Fluorescers of interest fall into a variety of categories 
having certain primary functionalities. These primary 
functionalities include 1- and 2-aminonaphthalene, p,p'- 
diaminostilbenes , pyrenes, quaternary phenanthridine salts, 9- 
5 aminoacridines , p,p'-diaminobenzophenone imines, anthracenes, 
oxacarbocyanine, merocyanine, 3-aminoequilenin, perylene, bis- 
benzoxazole, bis-p-oxazolyl benzene, 1, 2-benzophenazin, 
retinol, bis-3-aminopyridinium salts, hellebrigenin, 
tetracycline, sterophenol, benzimidzaolylphenylamine, 2-oxo-3- 
10 chromen, indole, xanthen, 7-hydroxycoumarin, phenoxazine r 

salicylate, strophanthidin, porphyrins, triarylmethahes and 
flavin. Individual fluorescent compounds which have 
functionalities for linking or which can be modified to 
incorporate such functionalities include, e.g., dansyl 
15 chloride; fluoresceins such as 3 , 6-dihydroxy-9- 

phenylxanthhydrol; rhodamineisothiocyanate ; N-phenyl l-aminb-8 
sulfonatonaphthalene; N^phenyl 2-amino-6-sulfonatonaphthalene; 
4-acetamido-4-isothiocyanato-stilbene-2,2 '-disulfonic acid; 
pyrene-3 -sulfonic acid; 2-toluidinonaphthalene-6-sulfonate; N- 
20 phenyl, N-methyl 2-aminoaphthalene-6-sulf onate; ethidium 

bromide ; stebrine ; auromine-0 , 2 - ( 9 ' -anthroyl ) palmitate ; dansyl 
phosphatidylethanolamine; N , N 1 -dioctadecyl oxacarbocyanine; 
l^N'-dihexyl oxacarbocyanine; merocyanine, 4- 
(3 'pyrenyl) butyrate; d-3-aminodesoxy-equilenin; 12— (9 1 - 
2 5 anthroyl ) s tearate ; 2 -methyl anthracene ; 9 -viny lanthr acene ; 2,2' 
(vinylene-p-phenylene)bisbenzoxazole; p-bis [2- (4-methyl-5- 
phenyl-oxazolyl) ]benzene; 6-dimethylamino-l, 2-benzophenazin; 
retinol; bis (3 ■ -aminopyridinium) 1,10-decandiyl diiodide; 
sulfonaphthylhydrazone of hellibrienin; chlorotetracycline; N- 
30 (7-dimethylamino-4-methyl-2-oxo-3-chromenyl) maleimide ; N- [p- (2 
benzimidazolyl) -phenyl] maleimide; N- (4-f luoranthyl) maleimide; 
bis(homovanillic acid); resazarin; 4-chloro-7-nitro-2, 1,3- 
benzooxadiazole; merocyanine 540; resorufin; rose bengal; and 
2 , 4-diphenyl-3 (2H) -furanone. 
35 Desirably, fluorescers should absorb light above 

about 300 ran, preferably about 350 nm, and more preferably 
above about 400 nm, usually emitting at wavelengths greater 
than about 10 nm higher than the wavelength of the light 
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B. Scanning system 

With the automated detection apparatus, the 
correlation of specific positional labeling is converted to the 
5 presence on the target of sequences for which the reagents have 
specificity of interaction. Thus, the positional information 
is directly converted to a database indicating what sequence 
interactions have occurred. For example, in a nucleic acid 
hybridization application, the sequences which have interacted 

10 between the substrate matrix and the target molecule can be 

directly listed from the positional information. The detection 
system used is described in PCT publication no. WO90/15070; and 
U.S. S.N. 07/624,120. Although the detection described therein 
is a fluorescence detector, the detector may be replaced by a 

15 spectroscopic or other detector. The scanning system may make 
use of a moving detector relative to a fixed substrate, a fixed 
detector with a moving substrate, or a combination. 
Alternatively, mirrors or other apparatus can be used to 
•transfer the signal directly to the detector. See, e.g, 

20 U.S. S.N. 07/624,120, which is hereby incorporated herein by 
reference. 

The detection method will typically also incorporate 
some signal processing to determine whether the signal at a 
particular matrix position is a true positive or may be a 

25 spurious signal. For example, a signal from a region which has 
actual positive signal may tend to spread over and provide a 
positive signal in an adjacent region which actually should not 
have one. This may occur, e.g., where the scanning system is 
not properly discriminating with sufficiently high resolution 

30 in its pixel density to separate the two regions. Thus, the 
signal over the spatial region may be evaluated pixel by pixel 
to determine the locations and the actual extent of positive 
signal. A true positive signal should, in theory, show a 
uniform signal at each pixel location. Thus, processing by 

35 plotting number of pixels with actual signal intensity should 
have a clearly uniform signal intensity. Regions where the 
signal intensities show a fairly wide dispersion, may be 
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maintain the quality and integrity of oligonucleotides. These 
include storing the substrate in a carefully controlled 
environment under conditions of lower temperature, cation 
depletion (EDTA and EGTA) , sterile conditions, and inert argon 
5 or nitrogen atmosphere. 

XII. INTEGRATED SEQUENCING STRATEGY 

A. Initial Mapping Strategy 

As indicated above, although the VLSIPS may be 

10 applied to sequencing embodiments, it is often useful to 
integrate other concepts to simply the sequencing. For 
example, nucleic acids may be easily sequenced by careful 
selection of the vectors and hosts used for amplifying and 
generating the specific target sequences. For example, it may 

15 be desired to use" specif ic vectors which have been designed to 
interact most efficiently with the VLSIPS substrate. This is 
also important in fingerprinting and mapping strategies. For 
example, vectors may be carefully selected having particular 
complementary sequences which are designed to attach to a 

2 0 genetic or specific oligomer on the substrate. This is also 
applicable to situations where it is desired to target 
particular sequences to specific locations on the matrix. 

In one embodiment, unnatural oligomers may be used to 
target natural probes to specific locations on the VLSIPS 

25 substrate. In addition, particular probes may be generated for 
the mapping embodiment which are designed to have specific 
combinations of characteristics. For example, the construction 
of a mapping substrate may depend upon use of another automated 
apparatus which takes clones isolated from a chromosome walk 

30 and attaches them individually or in bulk to the VLSIPS 
substrate. 

In another embodiment, a variety of specific vectors 
having known and particular "targeting" sequences adjacent the 
cloning sites may be individually used to clone a selected 
35 probe, and the isolated probe will then be targetable to a site 
on the VLSIPS substrate with a sequence complementary to the 
"target" sequence. 
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case, the sequencing capabrlrty Will greatly 
selection of appropriate sequences to be used as p ^ 
Also as indicated above, various means for 
constructing an- appropriate substrate "« ' 
30 mecbanical or automated ^gonucITotides « 

automated procedure mvolves ^<* e """ 9 In various other 
abort Polymers -ect^n - ^ syntheslzed 

— n^tri, ih an ordered^, other^ _ ^ 
f^ site specmcily directs collections of reagents to 
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specific locations using unnatural nucleotides or equivalent 
sorts of targeting molecules. 

While a brute force manual transfer process may be 
utilized sequentially attaching various samples to successive 
5 positions, instrumentation for automating such procedures may 
also be devised. The automated system for performing such 
would preferably be relatively easily designed and conceptually 
easily understood. 

10 XIII. COMMERCIAL APPLICATIONS 

A. fiAguencina 

As indicated above, sequencing may be performed 
either de novo or as a verification of another sequencing 
method. The present hybridization technology provides the 

15 ability to sequence nucleic acids and polynucleotides de novo, 
or as a means to verify either the Maxam and Gilbert chemical 
sequencing technique or Sanger and Coulson dideoxy- sequencing 
techniques. The hybridization method is useful to verify 
sequencing determined by any other sequencing technique and to 

20 closely compare two similar sequences, e.g., to identify and 
locate sequence differences. 

Of course, sequencing of can be very important in 
„any different sorts of environments. For example, it will be 
useful in determining the genetic sequence of particular 

25 markers in various individuals. In addition, polymers may be 
used as markers or for information containing molecules to 
encode information. For example, a short polynucleotide 
sequence may be included in large bulk production samples 
indicating the manufacturer, date, and location of manufacture 

30 of a product. For example, various drugs may be encoded with 
this information with a small number of molecules in a batch. 
For example, a pill may have somewhere from 10 to 100 to 1,000 
or more very short and small molecules encoding this 
information. When necessary, this information may be decoded 

35 from a sample of the material using a polymerase chain reaction 
(PCR) or other amplification method. This encoding system may 
be used to provide the origin of large bulky samples without 
significantly affecting the properties of those samples. For 
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av also be encoded by this memd 

example, chemical samples maY the so urce and 

t^Ly providing ^ 'o/igin of bulK hydrocarbon 

lufacturing details of lots ^ ^ Qf o pounds 
samples may be encoded Prod ^ ^ & ^ 

5 sue' as benzene or f^ul ^ also be encoded using 
.oiecule polymer. Foo ^ffs Y ^ sampleS can be 

siffl ilar -Ung.^-^ erigin . in this way, proper 

encoded determining the s enfo rced. 

disp osal can be ^/^coding may be provided by 

Similar sorts o enc solution is 

f ingerprinting-type analysis formation on 

absolve or less so, the concept of ^ and 

Molecules such as nucleic acid , ^ application 
at er decoded, may ^f^^es the ability to include 

This technology also p For example, a 

Mr *ers for origins - S wit, a particular 

MtOTt ed animal line nay be tran _ ^ ^ ongin . 

With a selection of multiple woU id have 
negligible tbat a ^^nLr — f 

independently arisen from a sea technigue may provide a 

".elf ically protected source. Thi CTlar biologl c.l 

Sane for tracing tbe actual will be 
aerials. Bacterxa P™^^. 
marking by such encoding sege 

B . EipgerEiiDtiaa _ nting technology may 

i. indicated ebove. «. e£ tingewri » t ing 
als c be used for data of particular 

n^ows --^rl fingerprinting tecbnology is ^ 
individuals. Where identif ication of large numu 

standardized, and used for ^ , pro cessing will be 

people, related equipment and perip ^ For example, 

developed to accompany or aut0 matically taking a 

s eciflc equipment may ^ing the information 

biological sample and generati g f inge rprinting 

^ ^ r printing substrate may be mas, 
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produced using particular types of automatic equipment. 
Synth tic equipment may produce the entire matrix 
simultaneously by stepwise synthetic methods as provided by the 
VLSI PS technology. The attachment of specific probes onto a 
5 substrate may also be automated, e.g., making use of the caged 
biotin technology. See, e.g., U.S. S.N. 07/612,671 (caged 
biotin CIP) . 

In addition, peripheral processing may be important 
and. may be dedicated to this specific application. Thus, 

10 automated equipment for producing the substrates may be 

designed, or particular systems which take in a biological 
sample and output either a computer readout or an encoded 
instrument, e.g., a card or document which indicates the 
information and can provide that information to others. An 

15 identification having a short magnetic strip with a few million 
bits may be used to provide individual identification and 
important medical information useful in a medical emergency. 

In fact, data banks may be set up to correlate all of 
this information of fingerprinting with medical information. 

20 This may allow for the determination of correlations between 
various medical problems and specific DNA sequences. By 
collating large populations of medical records with genetic 
information, genetic propensities and genetic susceptibilities 
to particular medical conditions may be developed. Moreover, 

25 with standardization of substrates, the micro encoding data may 
be also standardized to reproduce the information from a 
centralized data bank or on an encoding device carried oh an 
individual person. On the other hand, if the fingerprinting 
procedure is sufficiently quick and routine, every hospital may 

30 routinely perform a fingerprinting operation and from that 

determine many important medical parameters for an individual. 

In particular industries, the VLSIPS sequencing, 
fingerprinting, or mapping technology will be particularly 
appropriate. As mentioned above, agricultural livestock 
5 suppliers may be able to encode and determine whether their 
particular strains are being used by others. By incorporating 
particular markers into their genetic stocks, the markers will 
indicate origin of genetic material. This is applicable to 
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seed producers, livestock producers , and other suppliers of 
medical or agricultural biological materials. 

This may also be useful in identifying individual 
animals or plants. For example, these markers may be useful in 
5 determining whether certain fish return to their original 
breeding grounds , whether sea turtles always return to their 
original birthplaces, or to determine the migration patterns 
and viability of populations of particular endangered species. 
It would also provide means for tracking the sources of 
10 particular animal products. For example, it might be useful 
for determining the origins of controlled animal substances 
such as elephant ivory or particular bird populations whose 
importation or exportation is controlled. 

As indicated above, polymers may be used to encode 
15 important information on source and batch and supplier. This 
is described in greater detail, e.g., "Applications of PCR to 
industrial problems," (1990) in Chemical and Engineeri ng News 
68:145, which is hereby incorporated herein by reference. In 
fact, the synthetic method can be applied to the storage of 
20 enormous amounts of information. Small substrates may encode 
enormous amounts of information, and its recovery will make use 
of the inherent replication capacity. For example, on regions 
of 10 /im x 10 /xm, 1 cm 2 has 10 6 regions. An theory, the entire 
human genome could be attached in 1000 nucleotide segments on a 
25 3 cm 2 surface. Genomes of endangered species may be stored on 
these substrates . 

Fingerprinting may also be used for genetic tracing 
or for identifying individuals for forensic science purposes. 
See, e.g., Morris, J. et al. (1989) "Biostatistical Evaluation 
30 of Evidence From Continuous Allele Frequency Distribution DNA 
Probes in Reference to Disputed Paternity and Identity," J^. 
Forensic Science 34:1311-1317, and references provided therein; 
each of which is hereby incorporated herein by reference. 

In addition, the high resolution fingerprinting 
35 allows the distinguishability to high resolution of particular 
samples. As indicated above, new cell classifications may be 
defined based on combinations of a large number of properties. 
Similar applications will be found in distinguishing different 
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species of animals or plants. In fact, microbial 
identification may become dependent on characterization of the 
genetic content. Tumors or other cells exhibiting abnormal 
physiology will be detectable by use of the present invention. 
5 Also, knowing the genetic fingerprint of a microorganism may 
provide very useful information on how to treat an infection by 

such organism. 

Modifications of the fingerprint embodiments may be 
used to diagnose the condition of the organism. For example, a 

10 blood sample is presently used for diagnosing any of a number 
of different physiological conditions. A multi-dimensional 
fingerprinting method made available by the present invention 
could become a routine means for diagnosing an enormous number 
of physiological features simultaneously. This may 

15 revolutionize the practice of medicine in providing information 
on an enormous number of parameters together at one time. In 
another way, the genetic predisposition may also revolutionize 
the practice of medicine providing a physician with the ability 
to predict the likelihood of particular medical conditions 

20 arising at any particular moment. It also provides the ability 
to apply preventative medicine., 

Also available are kits with the reagents useful for 
performing sequencing, fingerprinting, and mapping procedures. 
The kits will have various compartments with the desired 

25 necessary reagents, e.g., substrate, labeling reagents for 
target samples, buffers, and other useful accompanying 
products . 

C. Mapping 

30 The present invention also provides the means for 

mapping sequences within enormous stretches of sequence. For 
example, nucleotide sequences may be mapped within enormous 
chromosome size sequence maps. For example, it would be 
possible to map a chromosomal location within the chromosome 

35 which contains hundreds of millions of nucleotide base pairs. 
In addition, the mapping and fingerprinting embodiments allow 
for testing of chromosomal translocations, one of the standard 
. problems for which amniocentesis is performed. 
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The present invention will be better understood by 

<- n -hhe following illustrative examples. The 
reference to the folloW1 ^ illu stration and not 

following examples are offered by way 

bv way of limitation. 
_ 7 Relevant applications whose technxques are 

incorporated herein by referent > 

W O90/15070, ^^^^T.:.* 07/-,l 2 0, filed 
WO91/07087, published May 30 199 Decemb er 6, 

December 6, 1990; and U.S.S.N. 07/626,730, ti 

10 199 °* Also, additional relevant techniques are described, 

w v t et al. (1989) "" 1o " l1ar Cloning: a 
e.g.. in saahrooK, J., at .1. < ^ ^ 

T ^nrn tnry wnua , 2d Ed., vols , ^ ^ 

Hew Yor*, Greenste>n and w a nit, — 
" sprin,er-veria g . «*, 

S p ri n g Harhor Press. Hew '"^f^ 

^^^^^^^^^ 
20 Bishop and Eawlmgs (1987) r " 01l£o rd ; Ha»es and 

,„,„,.. , , ~,,-t-i™l annroach. IKL Press, 

wue,-xnterscience 
35 -SSTS which is herehy incorporated herein hy reference. 
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EXAMPLES 

The following examples are provided to illustrate the 
efficacy of the inventions herein. All operations were 
conducted at about ambient temperatures and pressures unless 
5 indicated to the contrary. 

POLYNUCLEOTIDE SEQUENCING 

1. HPLC of the photolysis of 5»-0- 
nitroveratryl -thymidine . 

10 In order to determine the time for photolysis of 5'- 

O-nitrovertryl thymidine to thymidine a 100 /lM solution of NV- 
- Thym-OH ( 5 1 -O-nitrovertryl thymidine) in dioxane was made and 
-200 /xl aliquots were irradiated (in a quartz cuvette 1 cm x 2 
mm) at 362.3 nm for 20 sec, 40 sec, 60 sec, 2 min, 5 min, 10 

15 min, 15 min,. and 20 min.. The resulting -irradiated mixtures - 
were then analyzed by HPLC using a Varian MicroPak SP column 
(C analytical) at a flow rate of 1 ml/min and a solvent system 
of 40% CH 3 CN and 60% water. Thymidine has a retention time of 
1.2 min and NVO-Thym-OH has a retention time of 2.1 min. It 

20 was seen that after io min of exposure the deprotection was 
complete. 

2 . Preparation and Detection of Thymidine- 
Cytidine dimer (FITC) 

The reaction is illustrated: 
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' , rtlass slide (standard^LSIPS) 

To an aminopropylated glass slia 

was added a fixture of the following 

12.2 mg of NVO-Thym-COH (IX) 
3 4 mg of HOBT (N-hydroxybenztriazal) 
sis n DIEA (Diisopropylethylamlne) 
5 11.1 mg BOP reagent 

21 mg of tetrazole 
1 ml anhydrous CH^CN . 
M ter Oeino treated « ^l"^^^ 
sli de .as wasned o« ^'^J — • 
T /H 2 0/THF/lutiain. for 1 -m. The ^ ^ ^ 

dried, and treated for 30 mn wrth a 20 eJt p=sad 

J0 „«r. After ^ C SI is tnUanate in 

*.„ a' fttc solution (lmM fluorescein a . ,,»,„ 

to a FlTt- soiu riT . ie d and examined by 

DMF , for 5. »in, alon i. iliustrated: 

fluorescence microscopy. Tnis r 
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3. Preparation and Detection of Thymidine- 
Cytidine dimer (Biotin) 

An aminopropyl glass slide, was soaked in a solution 

of ethylene oxide (20% in DMF) to generate a hydroxylated 

5 surface. The slide was added a mixture of the following: 

32 mg of NVO-T-OCED (X) 

11 mg of tetrazole 

0.5 ml of anhydrous CH 3 CN 

After 8 min the plate was then rinsed with 

10 acetonitrile, then oxidized with I^H^/THF/lutidine for 1 min, 

washed and dried. The slide was then exposed to a 1:3 mixture 

of acetic anhydride : pyridine for 1 h, then washed and dried. 

The substrate was a then photolyzed in dioxane at 3 62 nm at 14 

mW/cm 2 for 10 min using a 500Mm checkerboard mask, dried, and 

,15. _ then treated with a mixture of the following: 

65 mg of biotin modified C (IV) 

11 mg of tetrazole 

0.5 ml anhydrous CH 3 CN 

After 8 min the slide was washed with CH 3 CN then 

20 oxidized with I^I^O/THF/lutidine for 1 min, washed, and then 

dried. The slide was then soaked for 30 min in a PBS/0.05% 

Tween 20 buffer and the solution then shaken off. The slide 

was next treated with FITC-labeled streptavidin at 10 /xg/*il in 

the same buffer system for 30 min. After this time the 

25 streptavidin-buf fer system was rinsed off with fresh PBS/0.05% 

Tween 20 buffer and then the slide was finally agitated in 

distilled water for about 1/2 h. After drying, the slide was 

examined by fluorescence microscopy (see Fig. 2 and Fig. 3). 

30 4. substrate preparation 

Before attachment of reactive groups it is preferred 
to clean the substrate which is, in a preferred embodiment, a 
glass substrate such as a microscope slide or cover slip. A 
roughened surface will be useable but a plastic or other solid 

35 substrate is also appropriate. According to one embodiment the 
slide is soaked in an alkaline bath consisting of, e.g., 
1 liter of 95% ethanol with 120 ml of water and 120 grams of 
sodium hydroxide ] for 12 hours. The slides are washed with a 
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, er allowed to air dry, and rS^d 
„„ under running water, allow 
buffer and under 

with a solution ot aminated with, e.g. , 

The slides are then ^ of atta ching ammo 

.onvltriethoxysilane for the purpos _ other 

5 — ace ° n rirr^d for this 

5 functional^ f^ 1 ^ 

. purpose. ^-^r^concentraticnsfr-lOlto 

aeinopropyltriethoxysilane. T ^ ^ approprxate 

rf-t t-^— TO "^Tlinut... "0 A of th * -f« S 

dipping in 100* ethanol. _ ^ heatea in a U.0-U0-C 

diPP After the ^ all o«ea to cure at 

20 vaeuu. oven for ahout » - «- -ironeent. 
ro o* tenperature for about »J» ylfOT ^ae, 

chloride. - • 

4- mocking of free sites 
5 . linker. ^ J^n exposed to 

The abated surface^ ^ ^ ^ solu tion of 
for example, 3 ° in DMF for 
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..voc-nuoleotide- HHS (Hydroxy of th . amino groups 

attachment of a KVOC-nuc eetxd. ^ nucleotide 

derivatives. 
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Cached,. re now capped ^ o£ ao etic anhydride in 
taction, hv exposure whic h eav perform thrs 

pyridine for 1 hour, other 
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• n,rt e trifluoroacetic anhydridlT 
residual capping reactiv e acylating agents. 

r^r- sk— — with dmf ' methylene 

chloride, and ethanol. 



- 4 ulustr r s =e: * 

trlM rs cf the *»2T^ t li« . » ,lass elide be.rin, • 
(represent.^ by C and T. re p oW10 xycarWxa»ld. 
, silane ,-ups as , substrate. Active esters 

(BV0C . OT ) residues is prepared ^ ^ 

(pentafluorophenyl, OBt, ««•> ^ prep , r ed as 

ro«c« ar the .• hydro*,! ,roup wi ^ ^ chai 

reacts. »U- - percent to ^ ^ must 

- ir^ttr^rarr;:— » £ ^ used « 

protect the primary chain. £ cycles 

For a »onomer set of si* n & ^ . ^ ft . 

are required to synthesize all possibl 

20 A cycle consists ° f < an appropriate mask 

l#; I-adiation t hr g ^ ^ where 

to expose the 5 9 w . th apprQ _ 

the next residue is to bv _ productS of the 

priate washes to remove the by pro 

deprotection. . ■ and protected 

25 single activated ana pro 

2 . Addition of a singi eTnovab le group) 

(w ith the same photochemically 

corner, which will react only - - ^ ^ 
addressed -step wit, « ^ ^ 
remove the excess re y - em b e r of the 

Th e above cycle i. .repeated — _ 

• »ononer set until each ^ „ other 

extended by one residue in one ? aaaed at one 

e^aiments. several residues ™£T cycle ti*es 

35 location before ^ - - - ^ reaction rate, no. as 
will generally be limited W ollgonU cleotide 
short as about 10 »in - --rllly^olloved by addition o, 
synthesizers, wis 
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a protecting group to stabilize the array for later testing. 
For some types of polymers (e.g., peptides), a final 
deprotection of the entire surface (removal of photoprotective 
side chain groups) may be required. 
5 More particularly, as shown in Fig. 4A, the glass 20 

is provided with regions 22, 24, 26, 28 r 30, 32, 34, and 36. 
Regions 30, 32, 34, and 36 are masked, indicated by the hatched 
regions, as shown in Fig. 4B and the glass is irradiated by the 
bright regions 22, 24, 26, and 28, and exposed to a reagent 

10 containing a photosensitive blocked C (e.g., cytosine 

derivative) , with the resulting structure shown in Fig. 4C. 
The substrate is carefully washed and the reactants removed. 
Thereafter,, regions 22, 24, 26, and 28 are masked, as indicated 
by the hatched region, the glass^ is irradiated (as shown 

15 in Fig. 4D) , as indicated by the bright regions, at 30, 32, 34, 
and 36, and exposed to a photosensitive blocked reagent 
containing T (e.g., thymine derivative), with the resulting 
structure shown in Fig. 4E. The process proceeds, 
consecutively masking and exposing the sections as shown until 

20 the structure shown in Fig. 4M is obtained. The glass is 

irradiated and the terminal groups are, optionally, capped by 
acetylation. As shown, all possible trimers of 
cytosine/ thymine are obtained. - - 

In this example, no side chain protective 

25 group removal is necessary, as might be common in modified 

nucleotides. If it is desired, side chain deprotection may be 
accomplished by treatment with ethanedithiol and trif luoro- 
acetic acid. 

In general, the number of steps needed to obtain a 
30 particular polymer chain is defined by: 

n x t (1) 

where: 

n = the number of monomers in the basis set of 
monomers , and 

35 £ = the number of monomer units in a polymer chain. 

Conversely, the synthesized number of sequences of 
length i will be: 

n e . (2) 



10 



PCT/IJ^09226 . 

WO 92/10588 95 ^0 

,. vprgitv is obtained by using 
« course, the synthesis of 

nasWng strategies „, in the extreme 

polymers having a length of le ^ , are 

5 synthesized, the numo + ^ + ^ ^ 

. „w of lithographic steps needed will 

The maximum number of lithog p 

each "layer" of monomers, i.e., 
generally be . for each 1 y ^ ^ lithographic 

number of musics (end, there . ^ ^ transparent mask 

size of synthesis areas - <A)/(S) 

\ is the total area available for synthesis, ana 
S is the number of seguenoes desired » the area. 

. « „nl be appreciated b, J^^^ 
that the above method could , nsing 

produce thousands or mxllxon. of o « 

the photolithographic technrgues to pract ically 

conseguently^th. * ^tra.'penta. 

^ZT^ "dec:. even dodecanucleotides. 

or larger polynucleotides. trated the me thod by way 

The above example has ilius* d tnat 

i« t*- will of course be apprecxau 
of a manual example. It wil The 

automated or ----^^.^^11 for automated addi- 
sub strate would be mounted m a flow cell ^ 
ti on and removal of reagents, to reaction 
reagents needed, and to more ca ^ ^ ^ roanually or 

Editions. "^^LtE» no. WO90/15070 and 

automatically. See, e.g., PCT puo 
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7. labeling of target • 
The target oligonucleotide can be labeled using 
standard procedures referred to above. As discussed, for 
certain situations, a reagent which recognizes interaction, 
5 e.g., ethidium bromide, may be provided in the detection step 
Alternatively, fluorescence labeling techniques may be applied, 
see, e.g., Smith, et al. (1986) Nature, 321: 674-679; and 
Prober, et al. (1987) Science, 238:336-341. The techniques 
described therein will be followed with minimal modifications 
as appropriate for the label selected. 
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8. dimers of A, C, G, and T 
The described technique may be. applied, with 

photosensitive blocked nucleotides corresponding to adenine, 
cytosine, guanine, and thymine, to make combinations of 
polynucleotides consisting of each of the four different, 
nucleotides. All 16 possible dimers would be made using a 
minor modification of the described method. 

9. 10-mers of A, C, G, and T 
The described technique for making dimers of A, C, G, 

and T may be further extended to make longer oligonucleotides. 
The automated system described, e.g., in PCT publication no. 
WO90/15070, and U.S.S.N. 07/624,120, can be adapted to make all 
possible 10-mers composed of the 4 nucleotides A, C, G, and T. 
The photosensitive, blocked nucleotide analogues have been 
described above, and would be readily adaptable to longer 
oligonucleotides. 

10. specific recognition hybridization to 10- 
mers 

The described hybridization conditions are directly 
applicable to the sequence specific recognition reagents 
attached to the substrate, produced as described immediately 
35 above. The 10-mers have an inherent property of hybridizing to 
a complementary sequence. For optimum discrimination between 
full matching and some mismatch, the conditions of 
hybridization should be carefully selected, as described above. 
Careful control of the conditions, and titration of parameters. 
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should be perform to determine the optimum collective 
conditions . 

11. hybridization 
wvbridization conditions are described m detail, 
Hybndxzatx gui^c^^ 
o a in Hames and Higgms l^^; ±2 . 

£££E?£5S« .re described, e.g. . in »et»ur and^ 
above, conditions are aesir Qr 

::n ssr ."r^ — . - - the GC 

iiitions used in Southern blot transfers. Typically, the GC 
" :ust"ni»i,ed b, the introduction of appropriate 
concentrations of the alfcyla^oniuB buffers, as descried 

ab ° Ve " Titration of the temperature and other parameters is 
.nLine the opti»u» conditions for specificity and 
" rC^S^-U* «*— hybridisation fro. 

etched hybridation. ^ ^ _ ^ ^ ^ _ 

generated, as described in Prober, et al. ,».7] IBM 

e «,4-i, .tal (1986) fiataES 321:674-679. 
25 ""T'Z ta ^; or t^l are^a sa»e length as. or 

few Of the probes hybridise perfectly with the target, and 
30 which particular ones did wouldbe 

The substrate and probes are muu 

« X salt buffer which minimizes GC bras is preferred, 

incorporating, e.g., buffer, such as tetra*ethy ™» "e 
Methyl — i« ion at between about 2 . 4 an M. 
„ood, et al. (1985) BEflS ffi^J ^ "- 1585 . 1 
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long as about 1 * ghort matches wxi 

conditions where ■ . .• for 

+-v«e conditions ror 

aaratl ° n ' W on -.1-1 WMidi ^eters initially titrated «• 

r^-^rJK^ — — — 

hyW id an, -^T * seq nencin, 

will be preferred ., !c 

■ .1 detection o£ specific 
15 , positional dere.i. 

1 ' interaction specific 
M indicated above, ~~£T~ positions „bere 
fractions nay be perused » _ t „e labei is 

£1U ° teSCent no « o9 and «.....«• processes 

publication no. «» ' particular, tbe syn iIio 

Idvantegeoualy . „trix pattern °* 

deecribed above will resu ^ a „ a . BM , patter 

25 seances .« corresp "^f^ont 

— "Tn alternative embodiment, a eepa rat 

• H differentially interacts ntt tbe^P or does not 

reagent will men ethid iuro bromide, may 

intercalating dye - g , ^ interaction . 

indicate the positions 

35 analysis . se quence 

conversion of tbe ^'^ces vbose anaiysis 
specificity .ill Ptoviae tbe -> ^ desorlbed abo ve. 

by overlap segments, Bay 
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bv the methodology described above, or 
An alysis is ab le fro, the Genetic Engineering 

using, e.g., software Yugosla via (Yugoslav 

Cent ? P Se ; a so Macevicz, PC T P-lication no. WO ,0/04652, 
group), see, also, reference. 

5 „hi=h U hereby preparation of short peptides 

Tb e descr^oo o £ «,e ^ u . s . s . s . 

« substrate incorporates cy 
on a sub ^ r and deS cribed below. 

07/492,462 (VLSIPS UL.*-> , 

POLYNUCLEOTIDE FINGERPRINTING 
10 P0LYN " ^etion on generation of reagents for 

The above section on g f in gerprinting 

seguencing P-""^'^ « - applied 
applications. Fingerprinting ^ 

15 classification, cell an £orensio use s for individual 

Classifioation. end genetic 

rS^SS- Mapping aginations are also 

simiiar to trslC-sTouL therein. „npi~»,. 
presence of S P* C1 ^°, S ^ £„ g erprinting will be longer than 
the subsequences used for fin rp seguenc ing. m 
the sequences used in oligonu determine the 

25 particular, specific long segments nay be 

similarity of ^PeLf ic cognations of 

ttfon are provided therein. Particuiar probe sequences 
information are pro ltlowl man ner to a 

are selected and attached in a p us a 

30 substrate. The means for attachment > m 
caged biotin method -1^ ^ uslng tar getin, 

nr. —e":^:: :: « 

T^ontarv binding molecule may u« 
35 ^ r=r and^ ^ ^0^ 

"orbe^eferred, e.g., unnatural 
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optica! U-». -** ™" " ot » ith MtUral 

— -J— ^ . su >etrate »ith parti~lar 

5 regions, the substrate may * ^oreation es to whether 

the r^^-^rtlS. «- corresponding seguenee 
r ."^ This WUX cften provide information 
lilac to a southern blot hybridization. 

10 T ^ r „r- a 1 po-roi npment 

Developmental HNA expression patterns 
The present fingerprinting invention also allows cell 
• I b v identification of developmental KNA 

ClaSSiflCa Tatterns For example, a lymphocyte ste* cell 
15 expression patterns- » ^ ion pf ^ spe cies. As the 

expresses a particular c developmental scheme, at 

Lymphocyte developsthrough pr - ^ ^ „ 

v ,rious ^ YerttS a r s^ee in development, hgain, the 
diagnostic of particular s g definition of 

2 „ Minting ^cd =1 ^ - - - ^ ^ 

specific structural features wn 

-. functional features whicn win 
developmental or al devel opmental classes, 

classification of cells into P ^ 

cells, products ~* °* «" — 
25 b e assayed to determine the deve P ^ 

cells, in this manner, once a devel p selected out 

. specific synch,^ 

30 development. allows for fingerprinting 

Th e present ^ ^ VcnlTfashion. the 
of the f ^ ^ de termina„t of developmental 
population, which should be a g tal £eatur es of the 

stage, will be correlate ^developmental stages 

35 oell. » this »nner cells enviroment , „ „ell 

„ili be characterized by the intraceliui 
as the extracellular environment. 



WO 92/10588 101 

nia gfiostj c Tests 
The present invention also provides the ability to 

\in tests Diagnostic tests typically are based 
perfor* diagnostic tests ^ the presence of 

upon a fingerprint type assay, wmcn 
5 specific diagnostic polynucl -^^^ , £Ton, 

^iZSi^-TL- diagnostic test. 
"C^iS-Uy defined s P ec i£ ic oligonucleotide reagent.. 

10 The present invention provides reagents and 

15 on L VLSIPS oligonucleotide substrate-can identify the 
presence ot particular viral genoi.es. 

Bacterial T J-n*if< ration 

. „,,, h . aoolicable to identifying 
sta ilar technigu es vi U 1 ^ 

20 a bacterial -^J^ source. of particular bacterial 
'"'Tee Jr e one bacterial assay nay be useful in 

Te^ Ln - natural range of survivability of particular 
strains of bacteria across regions of the country or 

25 different ecological niches. 

r.^ r ..rrnbir l- r-' ™enti fi cation s 
TOe ^entln^ntir^aesTeans for diagnosis of 

— -ek rrLTSs^'^rSt .1.0 

3 0 species and parasitic specif Hifferent 
w for assaying a combination of different 

Z « Iple a biological specinen Bay be assayed 
t : rr resence of 4 or all of ^obJLologl cal 

species, in hunan diagnostic uses, typica sanples 
35 blood, sputun, stool, urine, or other sanples. 
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P cr/us9^p26 



10 



- ~^~~f£zr~ - — 

• ™rint and identify a genetic xndxv d above 

Genetio "™ erprl "" ng j"° hyb ridizatio n plots. 

Genetio ling.rpri.tmg ha. .!« ^^ Jsi£j£igBS e 34= WU 
„ a Morris et «1. <1 989 ' » ^TZicribed above, an 

lar ge number of probes. The li su£licie ntly large 

u ould te » »n i^nticai ^ negUgibl e. However rt 

„u*er of probes may be statr rf pro „ es b e used 

i. often guite important that larg ^ ^ „ be 

-? Terr 1 " raTs: u -xt^- 

of . homologies xw*. th a sequence *f 

„ashes. Then, eaob ^"^ination of vhioh greatly 
„ a a homology measurement, the stati .tical 
Ureases tbe nu^er- ^ ^ gen etio 

likelihood of a perfect p 



individual. 



30 



test alleles vith marKers^ abiuty ^ 

Ih e present ^"""/^tduals. For example, a 
sctee n for genetio variations of rndr ^ 
^ber of genetio diseases «e ^ ^^U^ 

C. et ai. \. c / 



35 



^ 0 f al (eds.) iii^- 

see e.g., Scriber, C. et • York . In one 

Xk^^^' MCGraW ;! s b Un correlated witn a specific 
t^^y^f- S1 %^; b ^ 347: 382-386. A nuaber 
g ene, see, Gregory et <£• ^ de f iciencxes. 



See, e.g 
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Catalogs ~* rv^-in^nfc. Autosomal Recessive, and X- 

^ PhPnotvpes. Johns Hopkins University Press, Baltimore; 
Ott, J. (1985) anal ysis of Hm^ Oen.t.ic Linkage , Johns Hopkxns 
University Press, Baltimore; Track, R. et al. (1989) Banbury 
5 P . r ^+ -u> ; m TecMglaay ™« Forensic Science, Cold Spring 
Harbor Press, New York; each of which is hereby incorporated 
herein by reference. 

2. Amniocentesis 
Typically, amniocentesis is used to determine whether 
chromosome translocations have occurred. The mapping procedure 
ma y provide the means for determining whether these 
translocations have occurred, and for detecting particular 
alleles of various markers. 

MAPPING 

p-^^pii" Located Clones, 
The present invention allows for the positional 
location of specific clones useful for mapping. For example, 
caged biotin may be used for specifically positioning a probe 
to a location on a matrix pattern. 

in addition, the specific probes may be positionally 
directed to specific locations on a substrate by targeting. 
F or example, polypeptide specific recognition reagents may be 
attached to oligonucleotide sequences which can be 
complementarily targeted, by hybridization, to specific 
locations on a VLSIPS substrate. Hybridization conditions, as 
applied for oligonucleotide probes, will be used to target the 
reagents to locations on a substrate having complementary 
oligonucleotides synthesized thereon. In another embodiment, 
oligonucleotide probes may be attached to specific polypeptide 
targeting reagents such as an antigen or antibody. These 
reagents can be directed towards a complementary antigen or 
antibody already attached to a VLSIPS substrate. 

in another embodiment, an unnatural nucleotide which 
does not interfere with natural nucleotide complementary 
hybridization may be used to target oligonucleotides to 



20 



25 



30 
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particular positions on a substrate. Unnatural optical isomers 
of natural nucleotides should be ideal candidates. 

in this way, short probes may be used to determine 
the mapping of long targets or long targets may be used to map 
5 the position of shorter probes. See, e.g., Craxg et al, 1990 
jjuc h£2M Res - 18: 2653-2660. 

t>» g ^innall Y n ° fi ^ H Clones 
Positionally defined clones may be transferred to a 
L0 new substrate by either physical transfer or by synthetic 

TeJs synthetic means may involve either a production of the 

::re'onL ****** « 

Ly involve the attachment of a targeting sequence made by 
vSlPS synthetic methods which will target that P-£~J* 
15 defined clone to a position on a new substrate. Both methods 
will provide a substrate having a number of positionally 
defined probes useful in mapping. 

CONCLUSION 

The present inventions provide greatly improved 
methods and apperetu, for synthesis of polymers - 
« is to he understood that the above description is rntended 
" be illustrative and not restriotive. Many e*od 1 *ents^li 
be apparent to those of sXill in the art upon <T 
25 above description. By way of example, the invention has been 

e cribed primarily with reference to the use of^Phmtoremov ble 
protective groups, but it will be readily 

of skill in the art that sources of radiation other than light 
lid also be used. Per example, in some embodiments it may be 

,0 desirable to use protective groups which are « 

electron besm irradiation, x-ray irradiation, in combination 
^ith electron bsam lithograph, or x-ray lithography techniques. 
Alternatively, the group could be removed by exposure to an 
electric current. The scope of the invention should 

35 «erefore, be determined not with reference to the above 

Inscription, but should instead be determined with reference to 
the appended claims, along with the full scope of eguivalents 
to which such claims are entitled. 
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All publications and patent applications referred to 
herein are incorporated by reference to the same extent as if 
each individual publication or patent application was 
specifically and individually incorporated by reference. The 
5 present invention now being fully described, it will be 

apparent to one of ordinary skill in the art that many changes 
and modifications can be made thereto without departing from 
the spirit or scope of the appended claims. 
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WHAT IS CLAIMED IS : 

1. A composition comprising a plurality of 
positionally distinguishable sequence specific reagents 
attached to a solid substrate, which reagents are capable of 
; specifically binding to a predetermined subunit sequence of a 
preselected multi-subunit length having at least five subunits, 
said reagents representing substantially all possible sequences 
of said preselected length/ 

i 2. A composition of Claim 1, wherein said subunit 

sequence is a polynucleotide. 



3. A composition of Claim 1, wherein said specific 
reagent is an polynucleotide of at least about eight 

15 nucleotides. 

4. A composition of Claim 1, wherein said specific 
reagents are all attached to a single solid substrate. 

20 5. A composition of Claim 1, wherein said reagents 

comprise about 3000 different sequences. 

6. " A composition of Claim l f wherein said reagents 
represents at least about 25% of the possible subsequences of 

25 said preselected length - 

7. A composition of Claim 1, wherein said reagents 
are localized in regions of the substrate having a density of 
at least 25 regions per square centimeter. 
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8. A composition of Claim 4, wherein said substrate 
has a surface area of less than about 4 square centimeters. 

9. A method of analyzing a sequence of a 
5 polynucleotide, said method comprising the step of: 

a) exposing said polynucleotide to a 
composition of Claim 1- 

10. A method of identifying or comparing a target 
10 sequence with a reference, said method comprising the step of: 

a) exposing said target sequence to a 
composition of Claim 1; 

b) ' determining the pattern of positions of 

said reagents which specifically interact 
15 with said target sequence; and 

c) comparing said pattern with the pattern 
exhibited by said reference when exposed to 
said composition. 

20 ii. a method for sequencing a segment of a 

polynucleotide comprising the steps of: 
a) combining: 

i) a substrate comprising a plurality of 
chemically synthesized and 
25 positionally distinguishable 

oligonucleotides capable of 
recognizing defined oligonucleotide 
sequences; and 
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ii) a target polynucleotide; thereby 

forming high fidelity matched duplex 
structures, of complementary 
subseguences of known sequence; and 
determining which of said reagents have 
specifically interacted with subseguences 
in said target polynucleotide. 



b) 



12. A method of Claim 11, wherein said segment is 
10 substantially the entire length of said polynucleotide. 

13 . A method for sequencing a polymer , said method 

comprising the steps of: 

a) preparing a plurality of reagents which 
each specifically bind to a subsequence of 

15 

preselected length; 

b) positionally attaching each of said 
reagents to one or more solid phase 
substrates, thereby producing substrates of 

2Q positionally definable sequence specif ic 

probes ; 

c) combining said substrates with a target 
polymer whose sequence is to be determined; 

and 

z5 d) determining which of said reagents have 

specifically interacted with subsequences 
in said target polymer. 
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14. A method of Claim 13, wherein said substrates 



are beads . 

15. A method of Claim 13, wherein said plurality of 
5 reagents comprise substantially all possible subsequences of 

said preselected length found in said target. 

16. A method of Claim 13, wherein said solid phase 
substrates are a single substrate having attached thereto 

10 reagents recognizing substantially all possible subsequences of 
preselected length found in said target. 



step of analyzing a plurality of said recognized subsequences 



18. A method of Claim 14, wherein at least some of 
said plurality of substrates have one subsequence specific 
reagent attached thereto, and said substrates are coded to 

20 indicate the specificity of said reagent. 

19. A method of using a fluorescent nucleotide to 
detect interactions with oligonucleotide probes of known 
sequence, said method comprising: 

25 a) attaching said nucleotide to a target 



17. A method of Claim 13, further comprising the 



15 



to assemble a sequence of said target polymer. 



b) 



unknown polynucleotide sequence, and 
exposing said target polynucleotide 
sequence to a collection of positionally 



defined oligonucleotide probes of known 
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sequences to determine the sequences of 
said probes which interact with said 
target. 

20. A method of Claim 19, further comprising the 

step of: 

a) collating said known sequences to determine 
the overlaps of said known sequences to 
determine the sequence of said target 
sequence. 

21.. A method of mapping a plurality of sequences 
relative to one another, said method comprising: 

a) preparing a substrate having a plurality of 
positionally attached sequence specific 
probes are attached; 

b) exposing each of said sequences to said 
substrate, thereby determining the patterns 
of interaction between said sequence 
specific probes and said sequences; and 

c) determining the relative locations of said 
sequence specific probe interactions on 
said sequences to determine the overlaps 
and order of said sequences. 



22. A method of Claim 21, wherein said sequence 
specific probes are oligonucleotides. 
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23. A method of Claim 21, wherein said sequences are 
nucleic acid sequences. 
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